[PATCH 0/5] Fix ntpd to not poll frequently

Miroslav Lichvar mlichvar at redhat.com
Thu Sep 25 13:25:34 UTC 2014


On Wed, Sep 24, 2014 at 04:13:14PM +0200, Denys Vlasenko wrote:
> My experience with ISC ntpd (admittedly somewhat dated) is that it
> didn't try to do that hard enough. Somehow it seemed to its authors
> that "we need several minutes to sync the clock" is resonable.

I think it depends on what exactly you mean by syncing the clock. The
reference NTP implementation is known to be slow in the initial
synchronization, it adapts rather slowly to rapid frequency changes,
and the default stepout interval of 15 minutes is too long.

> It is not. Think about it. If you are setting the mechanical clock
> by looking at another, (presumably) correct clock, how long does it take?
> Few seconds, not minutes.

Setting a clock once by looking at another is quick, but that won't
correct the frequency offset. It would take days or weeks to have a
good estimate of that and then the length of the pendulum or spring
would have to be adjusted, if we want to compare it to what ntpd does.

> Keeping this in mind, bbox ntpd currently does a few things to speed up
> clock sync. Such as "revert to MINPOLL polling interval if we step the clock".
> The rationale is that if ntpd does discover that step is needed,
> something unusual happened. Such as my laptop hibernating:
> apparently my CMOS clock is busted, it doesn't "tick".

Does your system set the RTC to the system time before suspend? The
busybox ntpd doesn't seem to reset the MAXERROR field in adjtimex, so
the kernel RTC synchronization (aka 11-minute mode) is disabled and
something else is needed to set the RTC. Is that intentional?

> So after hibernating, the clock is off by at least a few seconds,
> sometimes much more. ntpd needs to basically start syncing anew.
> If it would do it with one request per 20 minutes, it won't go
> "reasonably fast", right?

No, I don't see why should be the polling interval reset in this case.
After the clock was stepped, the time offset is close to zero, the
frequency offset should be still good enough and the polling can
continue as before the system was suspended.

I'd be ok with it if suspending the system was the only reason the
clock can be stepped, but there are others, including

- another application is messing with the system clock
- remote clock was stepped
- network is congested
- jitter is so large that the measured offset is above the step
  threshold
- frequency offset between local and remote clock is so large that
  the time offset reaches the step threshold

>From these, I think polling interval should be shortened only in the
last case and there is a problem that it's not so easy to reliably
distinguish it from the other cases.

> """
> Don't reset the polling interval to the minimum when all peers are
> unreachable or the clock was stepped to avoid frequent polling.
> """
> 
> If all peers are unreachable, most likely it is a network problem.

If the local network connection is down, sendto() will fail and the
code will keep trying to send the packet in 5 second interval
(RETRY_INTERVAL) independently from the normal polling interval.
I think this is the most common case.

If it's a problem somewhere else, I'm not sure what assumptions could
be made. The network could be congested somewhere close and polling
frequently could be making it worse. If ntpd is configured to use only
one server, perhaps the service was stopped or the access was
restricted for some reason. How does it help to reset the polling
interval to the minimum here?

> Who know how long it lasted? What if it lasted many hours?
> I do want to syncronize my clock soon after network problem is fixed,
> not 20 minutes after that.

If the clock was synchronized before the sources became unreachable
and they were not reachable for many hours, does it matter much if the
first clock update after they are reachable again is delayed by one
long polling interval?

> """
> Keep increasing the polling interval in the following situations:
> - no replies are received from a peer
> - no source can be selected
> - peer claims to be unsynchronized (e.g. we are polling it too
>   frequently)
> - recv() returns with an error (e.g. the host doesn't exist or is not
>   running an NTP service)
> """
> 
> I am not sure any of these conditions warrant increasing poll interval.
> 
> Can you explain why you think it should be done?

To make sure the maximum polling interval is always reached as it
would normally. Imagine millions of clients not updating their clocks
for one of the reasons listed above and getting stuck at a short
polling interval (e.g. after they are restarted), increasing the
traffic unnecessarily by orders of magnitude.

-- 
Miroslav Lichvar


More information about the busybox mailing list