series of ctrl-c makes ssh session hang

Fri Feb 3 11:58:51 UTC 2017

On Thu, Feb 2, 2017 at 3:25 PM, Ronny Meeus <ronny.meeus at gmail.com> wrote:
>>> When pressing enter in the ssh session I see for dropbear:
>>> # strace -p 2066
>>> strace: Process 2066 attached
>>> _newselect(8, [3 5 7], [], NULL, {3516, 826061}) = 1 (in [5], left
>>> {3512, 787390})
>>> clock_gettime(0x6 /* CLOCK_??? */, {327, 955324187}) = 0
>>> read(5, ";\235\21\332\365\210T\200X}\230\"\306.\363\221", 16) = 16
>>> read(5, "\2\345\252\274\24Y\253\21\316>}\266\fU\20259\324\254Tu\3534\0238bMXzV\274\270",
>>> 32) = 32
>>> clock_gettime(0x6 /* CLOCK_??? */, {327, 955324187}) = 0
>>> writev(7, [{iov_base="\r", iov_len=1}], 1) = 1
>>> clock_gettime(0x6 /* CLOCK_??? */, {327, 956324197}) = 0
>>> _newselect(8, [3 5 7], [], NULL, {3600, 0}) = 1 (in [7], left {3599, 999987})
>>> clock_gettime(0x6 /* CLOCK_??? */, {327, 956324197}) = 0
>>> read(7, "\r\n", 16375)                  = 2
>>> clock_gettime(0x6 /* CLOCK_??? */, {327, 956324197}) = 0
>>> writev(5, [{iov_base="\231\310\271\315\354\243\342\271\22,\325Tj\n\356\345\"t\332d\205\317.\213\376\200\274h\201\347$\324"...,
>>> iov_len=48}], 1) = 48
>>> clock_gettime(0x6 /* CLOCK_??? */, {327, 957324207}) = 0
>>> _newselect(8, [3 5 7], [], NULL, {3600, 0}^Cstrace: Process 2066 detached
>>>
>>> While the sh process is not printing any additional traces. So this
>>> process is completely blocked:
>>> /isam/slot_default/run # strace -p 2078
>>> strace: Process 2078 attached
>>> futex(0xffed598, FUTEX_WAIT_PRIVATE, 2, NULL
>>>
>>>
>>> Connecting a debugger to the system (sh pid 2078) shows that the only
>>> thread the process has is blocked
>>> on a mutex in the C library.
>>>
>>> (gdb) info threads
>>>   Id   Target Id         Frame
>>> * 1    Thread 2078       0x1003d0ec in putprompt (s=<optimized out>)
>>> at shell/ash.c:2455
>>> (gdb) bt
>>> #0  0x0ff5c708 in __lll_lock_wait_private (futex=0xffed598
>>> <main_arena>) at ../nptl/sysdeps/unix/sysv/linux/lowlevellock.c:31
>>> #1  0x0fef07a8 in *__GI___libc_free (mem=<optimized out>) at malloc.c:3714
>>> #2  0x1003d0ec in putprompt (s=<optimized out>) at shell/ash.c:2455
>>> #3  setprompt_if (do_set=<optimized out>, whichprompt=<optimized out>)
>>> at shell/ash.c:2501
>>> #4  0x1003d448 in parsecmd (interact=<optimized out>) at shell/ash.c:12074
>>> #5  0x1004100c in cmdloop (top=<optimized out>) at shell/ash.c:12215
>>> #6  0x10042730 in ash_main (argc=<optimized out>, argv=<optimized
>>> out>) at shell/ash.c:13350
>>
>> Looks like signal interrupted malloc or free, then
>> signal handler longjmped (ash by design does that)
>> without returning to the malloc or free.
>> malloc state is now corrupted, and free()
>> in putprompt() deadlocks.
>>
>> INT_OFF/INT_ON pais guarding code which must not be
>> interrupted like this is missing somewhere.
>
> Interesting info, thanks.
>
> How do we continue to identify the place in the code?

I guess by code review and experiments. For example,
try adding "INT_OFF;" and "INT_ON;" around this
code block:

# if ENABLE_FEATURE_TAB_COMPLETION
                line_input_state->path_lookup = pathval();
# endif
                reinit_unicode_for_ash();
                nr = read_line_input(line_input_state, cmdedit_prompt,
buf, IBUFSIZ, timeout);

> Does this not mean that before all library calls we need to make sure
> signals are disabled?

Not all library calls, only some. For example, read() or strlen()
can be interrupted and longjmp'ed away with no ill effects.