klogd problem: questions about busybox behavior (1.15.3)

Thu May 27 00:53:24 UTC 2010

On Wednesday 26 May 2010 18:34, Paul Smith wrote:
> Now I do some stuff, then I "killall -q -TERM klogd" and ditto for syslogd:
> 
> > 2
> >  1086 root         0 Z    [klogd]
> >  2078 root      2196 S    syslogd -D -s 200 -b 1 
> >  2080 root      2196 S    klogd 
> >  2082 root      2200 S    grep logd 
> > Name:   klogd
> > Pid:    2080
> > PPid:   1
> > TracerPid:      0
> > Name:   klogd
> > Pid:    1086
> > PPid:   1
> > TracerPid:      0
> > Name:   syslogd
> > Pid:    2078
> > PPid:   1
> > TracerPid:      0
> 
> Note how syslogd was reaped, but not klogd!  Crazy.  Now a bit later in
> that same script, klogd is finally reaped:
> 
> > 4
> >  2078 root      2196 S    syslogd -D -s 200 -b 1 
> >  2080 root      2196 S    klogd 
> >  2128 root      2200 S    grep logd 
> > Name:   klogd
> > Pid:    2080
> > PPid:   1
> > TracerPid:      0
> > Name:   syslogd
> > Pid:    2078
> > PPid:   1
> > TracerPid:      0
> 
> But then I restart them and this time I see both daemons as zombies:
> 
> > 5
> >  2078 root         0 Z    [syslogd]
> >  2080 root         0 Z    [klogd]
> >  2138 root      2196 S    syslogd -D -s 200 -b 1 
> >  2140 root      2196 S    klogd 
> >  2142 root      2200 S    grep logd 
> > Name:   klogd
> > Pid:    2140
> > PPid:   1
> > TracerPid:      0
> > Name:   klogd
> > Pid:    2080
> > PPid:   1
> > TracerPid:      0
> > Name:   syslogd
> > Pid:    2138
> > PPid:   1
> > TracerPid:      0
> > Name:   syslogd
> > Pid:    2078
> > PPid:   1
> > TracerPid:      0
> 
> and that state persists for a while, while I start/stop dropbear and do
> various other things:
> 
> > 7
> >  2078 root         0 Z    [syslogd]
> >  2080 root         0 Z    [klogd]
> >  2138 root      2196 S    syslogd -D -s 200 -b 1 
> >  2140 root      2196 S    klogd 
> >  2183 root      2204 S    grep logd 
> > Name:   klogd
> > Pid:    2140
> > PPid:   1
> > TracerPid:      0
> > Name:   klogd
> > Pid:    2080
> > PPid:   1
> > TracerPid:      0
> > Name:   syslogd
> > Pid:    2138
> > PPid:   1
> > TracerPid:      0
> > Name:   syslogd
> > Pid:    2078
> > PPid:   1
> > TracerPid:      0
> 
> but eventually it all clears up BEFORE I exit my script (so it's
> definitely not just waiting until the script is done):
> 
> > 8
> >  2138 root      2196 S    syslogd -D -s 200 -b 1 
> >  2140 root      2196 S    klogd 
> >  2207 root      2200 S    grep logd 
> > Name:   klogd
> > Pid:    2140
> > PPid:   1
> > TracerPid:      0
> > Name:   syslogd
> > Pid:    2138
> > PPid:   1
> > TracerPid:      0
> 
> Getting a new busybox into my environment takes some finagling but I'll
> give it a try.

Newer busybox runs sysinit actions like this:

        /* Now run everything that needs to be run */
        /* First run the sysinit command */
        run_actions(SYSINIT);
        check_delayed_sigs();
        /* Next run anything that wants to block */
        run_actions(WAIT);
        check_delayed_sigs();
        /* Next run anything to be run only once */
        run_actions(ONCE);

run_actions() starts children and then wait for them
if their type calls for that (SYSINIT does) - via waitfor(pid),
which looks like this:

static void waitfor(pid_t pid)
{
        if (pid <= 0)
                return;
        /* Wait for any child (prevent zombies from exiting orphaned processes)
         * but exit the loop only when specified one has exited. */
        while (1) {
                pid_t wpid = wait(NULL);  <===================
                mark_terminated(wpid);
                /* Unsafe. SIGTSTP handler might have wait'ed it already */
                /*if (wpid == pid) break;*/
                /* More reliable: */
                if (kill(pid, 0))
                        break;
        }
}

As you see, it should be reaping any child, not just
specified pid (which is usually sysinit's pid).

Then, after sysinit et al is done, we fall into main loop:

        while (1) {
                int maybe_WNOHANG;

                maybe_WNOHANG = check_delayed_sigs();

                /* (Re)run the respawn/askfirst stuff */
                run_actions(RESPAWN | ASKFIRST);
                maybe_WNOHANG |= check_delayed_sigs();

                /* Don't consume all CPU time - sleep a bit */
                sleep(1);
                maybe_WNOHANG |= check_delayed_sigs();

                /* Wait for any child process(es) to exit.
                 *
                 * If check_delayed_sigs above reported that a signal
                 * was caught, wait will be nonblocking. This ensures
                 * that if SIGHUP has reloaded inittab, respawn and askfirst
                 * actions will not be delayed until next child death.
                 */
                if (maybe_WNOHANG)
                        maybe_WNOHANG = WNOHANG;
                while (1) {
                        pid_t wpid;
                        struct init_action *a;

                        /* If signals happen _in_ the wait, they interrupt it,
                         * bb_signals_recursive_norestart set them up that way
                         */
                        wpid = waitpid(-1, NULL, maybe_WNOHANG); <================
                        if (wpid <= 0)
                                break;

                        a = mark_terminated(wpid);
                        if (a) {
                                message(L_LOG, "process '%s' (pid %d) exited. "
                                                "Scheduling for restart.",
                                                a->command, wpid);
                        }
                        /* See if anyone else is waiting to be reaped */
                        maybe_WNOHANG = WNOHANG;
                }
        } /* while (1) */

which also is careful to wait for any pid, and also loops
so that if two gazillion processes died at once,
we don't reap them one per second, but all of them.

So yes, please do try newer version.

-- 
vda