RFD: Rework/extending functionality of mdev

Harald Becker ralda at gmx.de
Sat Mar 14 01:25:44 UTC 2015


On 13.03.2015 23:33, Laurent Bercot wrote:
> On 11/03/2015 08:45, Natanael Copa wrote:
>> With that in mind, wouldn't it be better to have the timer code in
>> the handler/parser? When there comes no new messages from pipe
>> within a given time, the handler/parser just exists.
>
> I've thought about that a bit, to see if there really was value in
> making the handler exit after a timeout. And it's a lot more complex
> than it appears, because you then get respawner design issues, the
> same that appear when you write a supervisor.

Which issues?

> What if the handler dies too fast and there are still events in the
> queue ?

> Should you respawn the handler instantly ?

spawning the handler is the job of the named pipe supervisor. At first
it checks the exit code of the dieing handler and spans a failure script
if not sucessfull. Then waits until data in pipe arrive (or is still
there = poll for reading), and finally span a new handler

The trick on this is, to hold the pipe open for reading and writing in
the supervisor. This way you avoid race conditions from recreating new
pipes, and catch even situation when an event arrive at the moment the
handler got a timeout and is dieing. Otherwise, does the supervisor not
touch the content transfered through the pipe.

> That's exactly the kind of load you're trying to avoid by having a
> (supposedly) long-lived handler. Should you wait for a bit before
> respawning the handler ? How long are you willing to delay your
> events ?

A bit more of checking is planned already, Currently I have an failure
counter and detect when parser successively dies unsuccessfully, but may
be we can add in an respawn counter, who triggers a delay (maybe
increasing) on to many respawns without processing all the pipe data,
but when handler exit and pipe is empty (poll), then respawn counter is
reset. So you get two or three fast respawns after handler dies (when
timeout on poll) and more data in pipe, then something seams to be
wrong, so start adding increasing delays before respawning. The normal
case is, when handler exit due to timeout, the pipe is empty, so we can
reset the counter and have no need to delay process respawn, as soon as
new data arrive in pipe. And when the respawn counter goes above some
limit or the handler dies unsuccessful, a failure script is spawned
first, with arguments programname, exit code or signal, failure count

> It is necessary to ask these questions, and have the mechanisms in
> place to handle that case - but the case should actually never
> happen: it is an admin error to have the event handler die to fast.

admins don't make errors! ;)

> So it's code that should be there but should never be used; and it's
> already there in the supervisor program that should monitor your
> netlink listener.

Ok, you expect the netlink listener be watched by a supervisor daemon? 
Fine so the fifo supervisor should also be watched, as it got forked 
from same process as the netlink reader ... that means when we detect 
handler failures, we can just die and let the outer supervisor do the job :)

When that happens the system is usually on it's way to hell ... and even 
if that happens, what does it mean to the system? ... hotplug events are 
no longer handled, we loose them and may have to re-trigger the plug 
events, as soon as hotplug events are processed again (however this is 
achieved) ... and in the worst case you are back at semiautomatic device 
management, calling "mdev -s" to update device file system.

... but consider conf file got vandalized, or the device file system ... 
how to suffer from this? ... do you expect to handle those? ... wouldn't 
it be better to reboot, after counting the failure in some persistent 
storage?


> So my conclusion is that it's just not worth it to allow the event
> handler to die. s6-uevent-listener considers that its child should
> be long-lived;

That's the problem of spawning the handler in your netlink reader. The 
netlink reader has to open the pipe for writing in non blocking mode, 
then write a complete message as a single chunk, failure check the write 
(you always need and handle it), done. If open/write to pipe is not 
possible, the device plug system has gone and need restart, so let the 
netlink listener die (unusual condition). One critical condition should 
be watched and handled, when pipe is full and write (poll for write) has 
timeout, what than? ... but this is not different then in your solution.

--
Harald


More information about the busybox mailing list