[PATCH 0/2] ntpd: retry name-resolution until success

Thu Feb 11 12:25:23 UTC 2016

On 11/02/2016 12:44, Denys Vlasenko wrote:
> In practice it is impossible to make daemons 100% robust.
> Because of this, we have several generations of babysitting tools
> which restart daemons on exit.

  I don't think it's the correct argument. Supervision tools should not
be an excuse to be complacent with robustness.

  However, in this particular case, I agree with the exiting behaviour:
the DNS resolution is only happening at start time, the daemon has not
entered its loop yet, so there's no state to maintain and waiting would
simply defer readiness. I think exiting is the correct thing to do when
an error happens before the daemon is ready.

>> That's utterly broken. You end up with an invalid or badly-wrong
>> clock for the first 20 seconds after boot, which could lead to all
>> sorts of problems with timestamps.
>
> You can't make your boot dependent on clock being set early.
> What if your network init always takes ~2 minutes?
> This is the case on my home DSL modem: DSL line training
> is that slow.

  I would love to have more input on this. My experience agrees with
Denys here: system clock initialization that depends on the network is
hard, because the clock may be incorrect until the network (and the
time synchronization daemon) is up.

  This is not a real issue for machines that have a battery-powered
hardware clock: the system clock can be set from the value of the
hardware clock early at boot, and should be accurate enough for the
boot process until the network is up.

  When there's no battery-powered hardware clock, however, I haven't
found a good solution for the early boot. The system clock *will* be
wildly inaccurate for some time.
  - It can be made "reasonable" by writing the system clock's value at
shutdown time and reading it back at boot time. However, for devices
with a read-only rootfs, this requires reading a value from a writable
filesystem, which is not necessarily mounted early in the boot process.
So there is still a part of the boot process, up until the correct
filesystem is mounted, that has a wildly inaccurate system clock. This
is not a complete solution.
  - For now, my solution is to set the system clock to an arbitrary,
"close enough" value at the very start of the boot process. This ensures
logs are not filled with nonsense values such as 1970-01-01. However,
it does not guarantee monotonic timestamps from one boot to the next one:
the early boot process, up until the mounting of read-write filesystems
and the reading of the old value of the system clock, will still have
incorrect, duplicate timestamps.

  This can be mitigated to an extent by getting the filesystem containing
the latest saved system clock value mounted ASAP, but I have not found a
perfect solution: the catch-all logger, as well as other early daemons
such as udevd or equivalent (which the action of finding and mounting
filesystems depends on), will still be started with a system clock that
does not have unique values.

  I wonder if there are better solutions.

-- 
  Laurent