Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.protocols.time.ntp > #164191 > unrolled thread

Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18

Started by"Harlan Stenn via questions Mailing List" <questions@lists.ntp.org>
First post2025-07-01 03:48 +0000
Last post2025-07-04 17:08 +0000
Articles 8 — 4 participants

Back to article view | Back to comp.protocols.time.ntp

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18 "Harlan Stenn via questions Mailing List" <questions@lists.ntp.org> - 2025-07-01 03:48 +0000
    Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18 Miroslav Lichvar <mlichvar@redhat.com> - 2025-07-01 11:00 +0000
      RE: [EXT] Re: Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18 "Windl, Ulrich" <u.windl@ukr.de> - 2025-07-02 10:23 +0000
        Re: [EXT] Re: Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18 "Miroslav Lichvar via questions Mailing List" <questions@lists.ntp.org> - 2025-07-02 14:58 +0000
          RE: [EXT] Re: Re: Re: Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18 "Windl, Ulrich" <u.windl@ukr.de> - 2025-07-07 10:58 +0000
      Re: [EXT] Re: Re: Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18 "Miroslav Lichvar via questions Mailing List" <questions@lists.ntp.org> - 2025-07-07 09:38 +0000
    Re: [EXT] Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18 "Miroslav Lichvar via questions Mailing List" <questions@lists.ntp.org> - 2025-07-02 10:43 +0000
      RE: [EXT] Re: Re: Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18 "Windl, Ulrich" <u.windl@ukr.de> - 2025-07-04 17:08 +0000

#164191 — Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18

From"Harlan Stenn via questions Mailing List" <questions@lists.ntp.org>
Date2025-07-01 03:48 +0000
SubjectRe: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18
Message-ID<90763509-b155-4fdb-8605-b861d8bd20b7@ntp.org>
As I've said before, just because the behavior is different does not 
mean it's broken.

The NTP algorithms do ongoing evaluations of established associations.

If an association becomes non-responsive, it auto-degrades.

At some point the association will drop.

Dave Mills was, to the best of my recollection, very hesitant to throw 
out an established association "too soon".

Let's get more information and understanding around what you're seeing.

H

On 6/30/2025 12:00 PM, Dave Hart wrote:
> 
> On Tue, May 27, 2025 at 12:13 UTC MOUHOUNE Samir 
> <samir.mouhoune@gmail.com <mailto:samir.mouhoune@gmail.com>> wrote:
> 
>     Dear NTP Community,
> 
>     We have observed a potentially unexpected behavior with |ntpd|
>     version *4.2.8p18* concerning the delay in transitioning to stratum
>     16 when a local reference clock (tsync) loses synchronization.
> 
> 
>           Issue Summary
> 
>     When our local reference clock (tsync) becomes unsynchronized, we
>     expect |ntpd| to stop selecting it and switch the system to stratum
>     16 relatively quickly, indicating the system is no longer a valid
>     time source.
> 
>     However, on systems running *ntpd 4.2.8p18*, this transition appears
>     *delayed by up to one hour*. During this time, |ntpd| continues to
>     treat tsync as a valid source and *reports stratum 1*, even though
>     synchronization is no longer valid.
> 
>     In comparison, this behavior does *not occur* in older versions like
>     *4.2.8p15*, where the system transitions to stratum 16 within a few
>     minutes.
> 
>     We suspect this may be due to internal changes in source selection
>     and trust logic introduced in later versions, possibly making |ntpd|
>     more conservative about declassifying known sources — even when they
>     become unreliable.
> 
> There have been no changes to ntpd/refclock_tsyncpci.c since 4.2.8p5 in 
> 2016, so the issue might well affect other refclocks.  It sounds like 
> something we need to fix, or at a minimum understand and justify as an 
> improvement.
> 
> 
>           Temporary Workaround
> 
>     We experimented with the following configuration adjustments, which
>     appear to mitigate the issue by making |ntpd| more responsive:
>     */tos orphanwait 1
>     tos mindist 0.05
>     tinker stepout 10
>     tinker panic 0
>     minpoll 3
>     maxpoll 4/*
> 
>     These parameters seem to accelerate response to changes in sync status.
> 
> 
> That's a lot of different knobs turned.  Did you make all 6 changes at 
> once and observe improvement, or one at a time, or ?
> I'm glad you found something to help out, and those might help point to 
> code change(s) responsible, but given so many knobs changed, not as 
> helpful as I might hope.
> 
> 
>           Questions
> 
>      1.
> 
>         Is this delay in downgrading to stratum 16 in 4.2.8p18 an
>         *intended behavior*, or is it considered a *regression* compared
>         to earlier versions?
> 
> It's hard to see why it would be intended, but we're far from 
> understanding the issue well enough to be definitive.
> 
>      1. Are there *recommended configuration settings* or best practices
>         to ensure timely transition to stratum 16 when a local reference
>         becomes unreliable?
> 
> Interesting renumbering of your questions is happening in the GMail web 
> editor using Chrome on Windows.
> I think it's fair to say we're pretty weak on documented best practices 
> or recommended configuration settings, but try https://doc.ntp.org/ 
> <https://doc.ntp.org/>  You could also look at the archives of this list 
> and its onetime evil twin newsgroup comp.protocols.time.ntp.
> 
>      1.
> 
>         Would it be appropriate to submit this as a *bug report*?
> 
> Yes, please, by all means.  That's generally true if you think you've 
> found a misbehavior, regression, suboptimal behavior, or just have a 
> request to improve.  The only thing we don't welcome reports to https:// 
> bugs.ntp.org/ <https://bugs.ntp.org/> about are reports of a security 
> nature, such as a remote crash of ntpd based on a port 123 query, or 
> nontrivial information disclosure or elevation of privileges, things 
> that might merit a CVE.  In that case, please submit the report to 
> security@ntp.org <mailto:security@ntp.org> to ensure the information is 
> not made public before remediation can be done.
> 
> I apologize for taking so long to respond.  I've had a lot going on in 
> my non-NTP life and I choose to have a relative firehose of email.  
> Thanks to Jakob for bubbling this up to my attention again.  Bug reports 
> can be ignored too, but much less likely than email.
> 
> Cheers,
> Dave Hart
> 

-- 
Harlan Stenn <stenn@ntp.org>
NTP Project Lead.  The NTP Project is part of
https://www.nwtime.org/ - be a member!

[toc] | [next] | [standalone]


#164192

FromMiroslav Lichvar <mlichvar@redhat.com>
Date2025-07-01 11:00 +0000
Message-ID<1040f45$2qa5k$1@dont-email.me>
In reply to#164191
On 2025-07-01, Harlan Stenn via questions Mailing List
<questions@lists.ntp.org> wrote:
> The NTP algorithms do ongoing evaluations of established associations.
>
> If an association becomes non-responsive, it auto-degrades.
>
> At some point the association will drop.
>
> Dave Mills was, to the best of my recollection, very hesitant to throw 
> out an established association "too soon".

Yes, what is reported in this thread as observed behavior of 4.2.8p15
and 4.2.8p18 both sound wrong to me. NTPv4 servers are not supposed to
claim they are unsynchronized (switch to stratum 16) when they lose a
working association (it doesn't matter if it's a refclock or NTP
server/peer). Quoting from RFC 5905 section 10:

   It is important to note that, unlike NTPv3, NTPv4 associations do not
   show a timeout condition by setting the stratum to 16 and leap
   indicator to 3.  The association variables retain the values
   determined upon arrival of the last packet.  In NTPv4, lambda
   increases with time, so eventually the synchronization distance
   exceeds the distance threshold MAXDIST, in which case the association
   is considered unfit for synchronization.

It seems this changed between 4.2.8p14 and 4.2.8p15 as a result of
fixing this bug:
https://bugs.ntp.org/show_bug.cgi?id=3644

The problem reported in that bug doesn't look like a bug to me.
I think it was working as intended in NTPv4. The current behavior is
a regression towards NTPv3.

-- 
Miroslav Lichvar

[toc] | [prev] | [next] | [standalone]


#164193 — RE: [EXT] Re: Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18

From"Windl, Ulrich" <u.windl@ukr.de>
Date2025-07-02 10:23 +0000
SubjectRE: [EXT] Re: Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18
Message-ID<e0bf2cdc994241049f4036a014892cd8@ukr.de>
In reply to#164192
Miroslav,

from the RFC citations found in the bug report it seems to be specified differently, or I misunderstood.

Kind regards,
Ulrich Windl

> -----Original Message-----
> From: Miroslav Lichvar <mlichvar@redhat.com>
> Sent: Wednesday, July 2, 2025 11:42 AM
> To: Windl, Ulrich <u.windl@ukr.de>
> Cc: Dave Hart <davehart@gmail.com>; questions@lists.ntp.org; Jürgen
> Perlinger <juergen.perlinger@t-online.de>; Jürgen Perlinger
> <perlinger@ntp.org>; Windl, Ulrich <windl@ntp.org>
> Subject: [EXT] Re: Re: Delay in Switching to Stratum 16 After Local Reference
> Loss on ntpd 4.2.8p18
> 
> Sicherheits-Hinweis: Diese E-Mail wurde von einer Person außerhalb des
> UKR gesendet. Seien Sie vorsichtig vor gefälschten Absendern, wenn Sie auf
> Links klicken, Anhänge öffnen oder weitere Aktionen ausführen, bevor Sie
> die Echtheit überprüft haben.
> 
> On Wed, Jul 02, 2025 at 09:23:12AM +0000, Windl, Ulrich wrote:
> > Actually, I had completely forgotten about that issue. Reading it again, it
> seems stratum should be 16 if all sources are unreachable (lost).
> 
> Why should it do that?
> 
> The idea in NTPv4 is that the decision if a source is acceptable
> should be made on the client side. If a server loses all time sources,
> its root dispersion will grow (15 ppm by default). If a client of that
> server has other sources, it can reselect when the distance becomes
> larger than that of the other sources.
> 
> If the server quickly switches to the unsynchronized state (as recent
> ntpd versions seem to be doing), the client can no longer synchronize
> to it, even if it has no other sources available. If there are
> multiple clients of that server, their clocks will not stay in sync,
> each will be drifting on its own.
> 
> --
> Miroslav Lichvar

[toc] | [prev] | [next] | [standalone]


#164195 — Re: [EXT] Re: Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18

From"Miroslav Lichvar via questions Mailing List" <questions@lists.ntp.org>
Date2025-07-02 14:58 +0000
SubjectRe: [EXT] Re: Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18
Message-ID<aGVIVbdlZAqr0YxA@localhost>
In reply to#164193
On Wed, Jul 02, 2025 at 09:50:37AM +0000, Windl, Ulrich wrote:
> Miroslav,
> 
> from the RFC citations found in the bug report it seems to be specified differently, or I misunderstood.

If you are referring to the Figure 24 of RFC 5905, which has "return
(UNSYNC)" for this path, there doesn't seem to be anything suggesting
that should be handled by resetting the clock to an unsynchronized
state.

The clock_select() function in A.5.5.1 doesn't have a return value and
in this case when no usable sources are present it doesn't do
anything, it just returns.

        /*
         * There must be at least NSANE survivors to satisfy the
         * correctness assertions.  Ordinarily, the Byzantine criteria
         * require four survivors, but for the demonstration here, one
         * is acceptable.
         */
        if (s.n < NSANE)
                return;

-- 
Miroslav Lichvar

[toc] | [prev] | [next] | [standalone]


#164198 — RE: [EXT] Re: Re: Re: Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18

From"Windl, Ulrich" <u.windl@ukr.de>
Date2025-07-07 10:58 +0000
SubjectRE: [EXT] Re: Re: Re: Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18
Message-ID<500e9a4e554446419f11e3f87560bead@ukr.de>
In reply to#164195
Hi!

Well things seems a bit more complex:
The kernel clock has a clock state, too: Initially after boot it is unsynchronized, but when  NTP updates the clock, it will be set to synchronizes, together with a few more parameters.
When the maximum error (the kernel clock advances automatically) reaches a threshold, the clock is set to unsynchronized again. So even if NTP crashes, the kernel clock may indicate a synchronized status for some time.
In contrast when NTP is considering itself UNSNC, will it set the kernel clock to unsynchronized immediately, or will it just stop sending updates to the kernel clock?
As I understood the discussion so far, the latter is the case.
What's not quite clear at the moment is whether NTP ever reads back the values provided from the kernel clock.

Traditionally NTP was assumed to know everything about the kernel clock, but maybe today the kernel clock knows its properties better than a generic NTP will, right?
So I think the interface between the NTP clock model and the kernel clock should be explained a bit better in the upcoming specification.
As I understand it , the NTP kernel clock model is optional, still.

Kind regards,
Ulrich Windl

> -----Original Message-----
> From: Miroslav Lichvar <mlichvar@redhat.com>
> Sent: Monday, July 7, 2025 11:34 AM
> To: Windl, Ulrich <u.windl@ukr.de>
> Cc: Dave Hart <davehart@gmail.com>; questions@lists.ntp.org; Jürgen
> Perlinger <juergen.perlinger@t-online.de>; Jürgen Perlinger
> <perlinger@ntp.org>; Windl, Ulrich <windl@ntp.org>
> Subject: [EXT] Re: Re: Re: Re: Delay in Switching to Stratum 16 After Local
> Reference Loss on ntpd 4.2.8p18
> 
> Sicherheits-Hinweis: Diese E-Mail wurde von einer Person außerhalb des
> UKR gesendet. Seien Sie vorsichtig vor gefälschten Absendern, wenn Sie auf
> Links klicken, Anhänge öffnen oder weitere Aktionen ausführen, bevor Sie
> die Echtheit überprüft haben.
> 
> On Fri, Jul 04, 2025 at 06:54:13AM +0000, Windl, Ulrich wrote:
> > Well,
> >
> > We could start a discussion what "UNSYNC" really means:
> > Does it mean the clock is free-running (not updated by the clock discipline),
> or does it mean the clock's estimated offset is "just terrible" (like 16
> seconds)?
> 
> I think in the context of the clock_select() function it means there
> is no source selected and the clock cannot be updated. The selection
> itself doesn't change the status of the clock. If it was previously
> considered to be synchronized, it will still be synchronized.
> 
> > With the former definitions it's likely that an issue is discovered earlier by
> monitoring IMHO.
> 
> The monitoring can check the reachability directly and discover the
> issue even sooner, no need to wait for the orphan timeout to activate
> after the source becomes unreachable.
> 
> --
> Miroslav Lichvar

[toc] | [prev] | [next] | [standalone]


#164197 — Re: [EXT] Re: Re: Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18

From"Miroslav Lichvar via questions Mailing List" <questions@lists.ntp.org>
Date2025-07-07 09:38 +0000
SubjectRe: [EXT] Re: Re: Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18
Message-ID<aGuUd-UE2_quWNi4@localhost>
In reply to#164192
On Fri, Jul 04, 2025 at 06:54:13AM +0000, Windl, Ulrich wrote:
> Well,
> 
> We could start a discussion what "UNSYNC" really means:
> Does it mean the clock is free-running (not updated by the clock discipline), or does it mean the clock's estimated offset is "just terrible" (like 16 seconds)?

I think in the context of the clock_select() function it means there
is no source selected and the clock cannot be updated. The selection
itself doesn't change the status of the clock. If it was previously
considered to be synchronized, it will still be synchronized.

> With the former definitions it's likely that an issue is discovered earlier by monitoring IMHO.

The monitoring can check the reachability directly and discover the
issue even sooner, no need to wait for the orphan timeout to activate
after the source becomes unreachable.

-- 
Miroslav Lichvar

[toc] | [prev] | [next] | [standalone]


#164194 — Re: [EXT] Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18

From"Miroslav Lichvar via questions Mailing List" <questions@lists.ntp.org>
Date2025-07-02 10:43 +0000
SubjectRe: [EXT] Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18
Message-ID<aGT-5oBONitTpxU3@localhost>
In reply to#164191
On Wed, Jul 02, 2025 at 09:23:12AM +0000, Windl, Ulrich wrote:
> Actually, I had completely forgotten about that issue. Reading it again, it seems stratum should be 16 if all sources are unreachable (lost).

Why should it do that?

The idea in NTPv4 is that the decision if a source is acceptable
should be made on the client side. If a server loses all time sources,
its root dispersion will grow (15 ppm by default). If a client of that
server has other sources, it can reselect when the distance becomes
larger than that of the other sources.

If the server quickly switches to the unsynchronized state (as recent
ntpd versions seem to be doing), the client can no longer synchronize
to it, even if it has no other sources available. If there are
multiple clients of that server, their clocks will not stay in sync,
each will be drifting on its own.

-- 
Miroslav Lichvar

[toc] | [prev] | [next] | [standalone]


#164196 — RE: [EXT] Re: Re: Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18

From"Windl, Ulrich" <u.windl@ukr.de>
Date2025-07-04 17:08 +0000
SubjectRE: [EXT] Re: Re: Re: Delay in Switching to Stratum 16 After Local Reference Loss on ntpd 4.2.8p18
Message-ID<637bfd260e184bb0844d2d746b6646c3@ukr.de>
In reply to#164194
Well,

We could start a discussion what "UNSYNC" really means:
Does it mean the clock is free-running (not updated by the clock discipline), or does it mean the clock's estimated offset is "just terrible" (like 16 seconds)?
With the former definitions it's likely that an issue is discovered earlier by monitoring IMHO.
I think an UNSYNC clock could still provide an estimated an maximum error.

Kind regards,
Ulrich Windl

> -----Original Message-----
> From: Miroslav Lichvar <mlichvar@redhat.com>
> Sent: Wednesday, July 2, 2025 4:55 PM
> To: Windl, Ulrich <u.windl@ukr.de>
> Cc: Dave Hart <davehart@gmail.com>; questions@lists.ntp.org; Jürgen
> Perlinger <juergen.perlinger@t-online.de>; Jürgen Perlinger
> <perlinger@ntp.org>; Windl, Ulrich <windl@ntp.org>
> Subject: [EXT] Re: Re: Re: Delay in Switching to Stratum 16 After Local
> Reference Loss on ntpd 4.2.8p18
> 
> On Wed, Jul 02, 2025 at 09:50:37AM +0000, Windl, Ulrich wrote:
> > Miroslav,
> >
> > from the RFC citations found in the bug report it seems to be specified
> differently, or I misunderstood.
> 
> If you are referring to the Figure 24 of RFC 5905, which has "return
> (UNSYNC)" for this path, there doesn't seem to be anything suggesting
> that should be handled by resetting the clock to an unsynchronized
> state.
> 
> The clock_select() function in A.5.5.1 doesn't have a return value and
> in this case when no usable sources are present it doesn't do
> anything, it just returns.
> 
>         /*
>          * There must be at least NSANE survivors to satisfy the
>          * correctness assertions.  Ordinarily, the Byzantine criteria
>          * require four survivors, but for the demonstration here, one
>          * is acceptable.
>          */
>         if (s.n < NSANE)
>                 return;
> 
> --
> Miroslav Lichvar

[toc] | [prev] | [standalone]


Back to top | Article view | comp.protocols.time.ntp


csiph-web