Ash + telnetd: telnet client hangs after exit

Mon Oct 15 16:03:41 UTC 2007

Denys Vlasenko wrote:
> On Monday 15 October 2007 15:15, Alexander Kriegisch wrote:
>   
>> BTW, the output "ps" output diff looks like this on i386:
>>
>> $ diff -U0 ps1.txt ps2.txt
>> --- ps1.txt     2007-10-15 16:13:06.000000000 +0200
>> +++ ps2.txt     2007-10-15 16:13:25.000000000 +0200
>> @@ -130 +129,0 @@
>> -root     31704 31643  0 16:10 ?        00:00:00 [login] <defunct>
>>     
>
> Aha. You have a login which does not exec shell, it spawns it
> as a child. When shell exits, login exits too.
>
> What looks strange to me is that you see a _zombie_ login.
> It means that it exited, but is not waited for yet.
>
> How come telnetd doesn't see EOF from login's fd? It *exited*,
> and that implicitly closes all fds! I'm puzzled.
>   
telnetd doesn't see EOF on the pty side because the client pty is open 
by a child of the shell, not the shell itself.

The reason why the process is a zombie is that telnetd will only wait 
for the child after sending SIGKILL, and it will only send SIGKILL after 
either the network connection or the client side of the pty is closed. 
But in this case the problem is that the client pty is not closed.
> (1) can you check PPID of zombie login? Is it 1 or <telnetd's PID>?
>   
Therefor the PPID should be telnetd. Also, if the PPID would be init 
that would mean that init is not working, which is unlikely.
> (2) is it possible that you start telnetd so that it inherits
>     "ignore SIGCHLD" from the parent? Try adding this line
>     after signal(SIGPIPE, SIG_IGN):
>
>      signal(SIGPIPE, SIG_IGN);
> +    signal(SIGCHLD, SIG_DFL);
>   
SIGCHLD should already be SIG_DFL. Maybe you meant SIG_IGN, but ignoring 
SIGCHLD should change nothing except for the left of zombie process.
> Unrelated note: I looked at telnetd source and tightened up
> some loose ends. Can you test this patch? (I don't think
> it will help with this particular problem, though...)
>   
The only change in your patch which seems relevant to this problem is 
where you removed the SIGKILL before the wait(). I think this is 
dangerous in a single-threaded server. The child shell will probably 
eventually clean up and exit, but if that takes some seconds, all other 
connections will hang for that time.

As I wrote earlier, the problem is that the shell exits but the client 
pty remains open. You can reproduce this as follows:

telnet server
# login ...
# on the server:
sleep 10 & exit

The telnet client will only exit after sleep terminates.
SIGCHLD is ignored (as in SIG_DFL) at the moment and the session will 
only be terminated when the client pty or the network connection is closed.

Now the question is whether this should be considered a bug or a feature.

If it is considered a bug, the solution is to remove the sigkill and the 
wait, install a signal handler for SIGCHLD, inside the signal handler 
find the tsession for the pid and close the session.

Regards
Ralf Friedl