You can't spell "evil" without "vi".
rob at landley.net
Wed Oct 15 16:36:19 UTC 2008
On Tuesday 14 October 2008 05:34:46 Roberto A. Foglietta wrote:
> 2008/10/14 Denys Vlasenko <vda.linux at googlemail.com>:
> > On Tuesday 14 October 2008 10:48:54 am Rob Landley wrote:
> >> On Sunday 12 October 2008 23:48:22 Rob Landley wrote:
> >> > Next time it reads a buffer, it starts with the last character of a
> >> > cursor left sequence: capital D. Capital D is "delete to end of
> >> > line", which it does.
> >> >
> >> > So basically, busybox vi is corrupting your data when you cursor
> >> > around in a file on a loaded system. Wheee...
> >> Hmmm... I redid the readit() code to only read ahead when processing an
> >> escape sequence. (This let me shrink the readahead buffer to 8 bytes,
> >> actually 5 but with 32 bit alignment 8 makes more sense. Bloatcheck
> >> says I shrunk the code by 17 bytes.)
> > Disregard my previous patch, I just looked at your code and it's
> > as good but it's smaller than mine, so let's use yours.
> >> Unfortunately, this mitigated the problem a bit, but didn't actually
> >> _fix_ it. It happens less often, but I can still trigger it.
> >> I _think_ this is actually a qemu issue. The escape sequences are being
> >> generated by the host Linux, which are then sent to the qemu process
> >> over a virtual serial console, which breaks them down into individual
> >> bytes with an interrupt for each.
> >> This means that the blocking we're depending on to parse escape
> >> sequences doesn't work over a serial console. You _can get an escape
> >> character by itself with poll saying there's no more data, and then on
> >> the next clock cycle you can get a "[D".
> >> Hmmm...
> >> Ok, making poll wait 300 miliseconds before deciding there's no next
> >> character in a pending escape sequence seems to have fixed it. (At
> >> least I can't reproduce the problem under qemu anymore.)
> > Please document this next time, or someone else might come later
> > and delete the timeout. I did this a few mins ago :( will fix it now.
> > Did you try something smaller than 300ms?
> As far as I understood the problem: considering a 1200bps line, 120
> chars per second, 40 escape sequence per seconds, then the minimum
> timeout should be at least 1/40 sec = 25 ms.
It's not the number of escape sequences per second, it's the delay between
each character in the escape sequence. (The problem is when we don't
recognize that this string of characters should be interpreted as a block,
and instead interpret them as individual characters.) So if this was the
_only_ issue, the minimum delay needed is 1/120th of a second. (But see
> This is for a fixed speed
> line, considering an asynchronous data line with an average speed of
> 1200bps and a bell curve variance of 25 ms then using a timeout of
> 100ms would catch the 0.999936657516% of the escape sequences.
I'm not following that at all. If it's just a question of giving the next
character time to come in at 1200bps, a 25 ms delay should catch 100.00% of
the escape sequences. There's no "variance" in the timing of characters
coming in on a serial line, if they didn't work on a hard clock the data
would be corrupted at the physical link layer.
> Enlarging the timeout to 150ms 0.999999998027%. Over 200ms should not
> make any sense any more if the variance has been correctly estimated
> in 25 ms.
Where does statistics enter into serial communications?
No, the specific case I'm worried about is virtual systems running under an
emulator. (In addition to kvm and xen and such, I bang on qemu a lot to test
busybox on non-x86 platforms. A virtual serial console is a standard way to
talk to that sort of thing.)
The emulator process is a normal userspace process on the host system, subject
to the process scheduler and thus capable of having its execution suspended
at any time for an arbitrary length of time on a loaded host system. When
the host system schedules away the emulator process, the virtual system's
processor pauses. But the poll() timeout is in wall clock time, which
doesn't pause. This means that when the emulator _does_ resume, there's a
race between the timer interrupt (telling the poll to return because there
was no data within the requested time period) and the serial interrupt
(queueing the new data). Which one takes precedence? I have no idea, that's
a "dig into the guts of the Linux kernel" thing.
If the timeout take precedence over the servicing the serial interrupt, then
we need a long enough delay to make sure that the emulator process gets
scheduled to handle the next character before the timeout. If the serial
interrupt takes precedence over the timer, then even a delay of 25
miliseconds should be plenty.
(The other fun little detail is that the _sending_ process on the other end of
the serial line can get scheduled away, but it submits data to the kernel as
a multi-character block so even if your terminal process gets scheduled away,
it either hasn't sent the escape sequence yet or has queued the whole thing
into the kernel's serial output buffer which isn't subject to userspace
scheduling. At least as long as the write blocks until complete instead of
returning a short write. I think.)
I think I can work out a test for this (write a script suspending the qemu
process with "while true; do kill -STOP $PID; sleep 1; kill -CONT $PID; sleep
1; done" and then hold down "cursor left" in vi for a couple minutes and see
if it zaps the line it's one. Those delays are long enough that even the
current 300 ms wait to connect a partial sequence won't be long enough, so it
should either manifest the bug or let us know that we can cut the delay down
to something unnoticeable like 50ms and still be safe.)
(P.S. Most network things send data in larger blocks (256 bytes for ppp, 1500
bytes for 100baseT, 65k for gigabit) so this problem doesn't easily manifest
via ssh, just via serial connection. It could theoretically manifest via ssh
on a poor connection where you queue up 500+ cursor movements, have an escape
sequence cross packet boundaries due to the MTU size, and then have the
second packet delayed by retransmission. But that's kind of unlikely.)
More information about the busybox