sed behaving strangely when -n and the delete command are combined

Rob Landley rob at landley.net
Sat Jul 11 01:41:09 UTC 2009


On Friday 10 July 2009 04:54:39 Denys Vlasenko wrote:
> > I note the comment could just be one line:
> >
> > /* "1d;1,ENDp" should start matching at line 2. */
>
> Are you serious? I mean, do you really complain that I do not write
> comments in exactly the way you like?

I suggested that the comment could be shortened and break the flow less, but 
mostly my concern was changing the parenthetical indentation level to be a lot 
more complicated with no indentation hints to help follow it.

You are, of course, free to ignore anything I suggest.  It's your project.

> >> || (sed_cmd->end_line && sed_cmd->beg_line < linenum &&
> >> || sed_cmd->end_line
> >> >
> >> >= linenum)
> >
> > This is more complicated than it needs to be.
> >
> > If you're going to slavishly copy what gnu does, you might want to run a
> > few more tests to see what that actually is.  Specifically:
> >
> >  echo -e "one\ntwo\nthree\nfour" | sed -ne '2d;2,1p'
> >  three
> >
> > So the logic can be simplified: when your start is a number it triggers
> > on the next line >= to that number.   Just change the match start to >=,
> > and then disable numerical starting matches are disabled after the first
> > hit.
>
> I did not think about this example. Let me try... aha, now works:
>
> # echo -ne "first\nsecond\nthird\nfourth\n" | sed -n '1d;2,1p'
> second
> # echo -ne "first\nsecond\nthird\nfourth\n" | ./busybox sed -n '1d;2,1p'
> second
>
> But this breaks another case:
>
> # echo -ne "first\nsecond\nthird\nfourth\n" | sed -n '1d;1,3p'
> second
> third
> # echo -ne "first\nsecond\nthird\nfourth\n" | ./busybox sed -n '1d;1,3p'
> (nothing)
>
> See? You waste your time writing up agressive emails
> but you make mistakes too.

I'm not attempting to be aggressive.  This is your project, you are 
maintainer.  I have no authority to challenge that.  (Or interest.  Or 
anything like the spare time and energy you're putting in if I did.)

I'd just like it to _work_, because I tend to break things.  And I like them 
to be simple, because _when_ I break things I have to wade through the code to 
figure out how to fix it, and wading through the sed code was never fun.

In this case, I've been suggesting that your entire patch could be replaced 
with something like the following small change (against 1.14.1) to address 
this bug:

--- busybox-1.14.1/editors/sed.c	2009-05-13 09:32:07.000000000 -0500
+++ busybox.new/editors/sed.c	2009-07-10 19:05:54.000000000 -0500
@@ -893,7 +893,7 @@
 			|| (!sed_cmd->beg_line && !sed_cmd->end_line
 				&& !sed_cmd->beg_match && !sed_cmd->end_match)
 			/* Or did we match the start of a numerical range? */
-			|| (sed_cmd->beg_line > 0 && (sed_cmd->beg_line == linenum))
+			|| (sed_cmd->beg_line > 0 && (sed_cmd->beg_line <= linenum))
 			/* Or does this line match our begin address regex? */
 			|| (beg_match(sed_cmd, pattern_space))
 			/* Or did we match last line of input? */
@@ -906,6 +906,7 @@
 		/* Is this line the end of the current match? */
 
 		if (matched) {
+			if (sed_cmd->beg_line > 0) sed_cmd->beg_line = -1;
 			sed_cmd->in_match = !(
 				/* has the ending line come, or is this a single address command? */
 				(sed_cmd->end_line ?

That patch (to 1.14.1) handles both of the cases you listed just fine.  Why 
you're posting enormous patches to address something this simple is what I'm 
trying to _understand_.

I'd run the test suite against it to make sure it didn't cause any other 
regressions, but for some reason "make test" appears to be a NOP in an 
allnoconfig+sed 1.14.1:

>landley at driftwood:~/busybox/busybox.new$ make test
>test -d /home/landley/busybox/busybox.new/testsuite || cp -pPR
> /home/landley/busybox/busybox.new/testsuite
> /home/landley/busybox/busybox.new bindir=/home/landley/busybox/busybox.new
> srcdir=/home/landley/busybox/busybox.new/testsuite \ /bin/sh -c "cd
> /home/landley/busybox/busybox.new/testsuite &&
> /home/landley/busybox/busybox.new/testsuite/runtest "
> landley at driftwood:~/busybox/busybox.new$

I'd check that I was calling it right, but it no longer appears to be listed 
in "make help".  (At least "make test" is a recognized target, "make tests" 
and "make testsuite" aren't.)

Going into the testsuite diectory and running "./runtest sed" by hand says 
that sed wasn't built.  (Even though I can run it by hand out of the busybox 
binary I built, one directory up.)

On a related note, the busybox multiplexer also appears to be broken, at least 
for allnoconfig+sed.  When I just tried it (make distclean, make allnoconfig, 
make menuconfig, switch on editors->sed, make) it went:

$echo -e "first\nsecond\nthird\nfourth" | ./busybox sed -n '1d;1,3p'
sed: unmatched 'e'
bash: echo: write error: Broken pipe

I had to create a "sed" symlink and use that to test this functionality.  
(Switching long options back on didn't fix it.  I encountered that in 1.14.1 
but just confirmed that this behavior is the same for me in current -git.  
(Perhaps that's what's confusing the test suite?)

By the way, the first time I tried to build my above patch to test it (again 
working against the 1.14.1 release version), when I did a "make -j" (again on 
the allnoconfig plus editors->sed version), it did this about halfway through:

libbb/appletlib.c:53:27: error: applet_tables.h: No such file or directory
libbb/appletlib.c:149:5: warning: "NUM_APPLETS" is not defined
libbb/appletlib.c:159:5: warning: "NUM_APPLETS" is not defined
libbb/appletlib.c: In function ‘find_applet_by_name’:
libbb/appletlib.c:170: error: ‘NUM_APPLETS’ undeclared (first use in this 
function)
libbb/appletlib.c:170: error: (Each undeclared identifier is reported only once
libbb/appletlib.c:170: error: for each function it appears in.)
libbb/appletlib.c: In function ‘busybox_main’:
libbb/appletlib.c:644: error: ‘MAX_APPLET_NAME_LEN’ undeclared (first use in 
this function)
make[1]: *** [libbb/appletlib.o] Error 1
make[1]: *** Waiting for unfinished jobs....

That might have been fixed in the current version, I haven't really tried to 
reproduce it since.  It seems to be an asynchronous weirdness with make -j, 
and I stopped feeding -j to the busybox build.

(When I say "I break things", I mean that trying to test the fix for one bug 
and encountering three other unrelated bugs along the way is a normal 
afternoon for me.  It always has been.  I don't know why.  It's kind of 
annoying, really.)

> > Beyond
> > that, you want to fall through like it was already doing to the normal
> > match ending logic, which has had <= linenum since my days.
> >
> > So you can get closer to the gnu behavior by changing the code _less_.
> >  Which
>
> Nice theory, until you actually try to implement it instead of just
> describing how easy it is and how idiotic someone else's code is.

See above.  I'm sorry, I honestly thought I was describing it clearly.  My 
bad.

> > means you don't have to avoid snapshotting the line and continuing on to
> > the "should we end the match now" logic.
> >
> > Also, why do you discard the comment about snapshotting the line?  The
>
> -               /* Snapshot the value */
>                 matched = sed_cmd->in_match;
>
> Because comment says that we are saving a value. Well.
> That's kind of obvious. The comment should say why we do that
> in order to be useful.

Ok, agreed.

It's a sequencing issue, we have to record whether or not we're in a match 
before ending that match.  And we have to decide whether or not we're ending 
the match before acting on the mamtch.  (For one thing, we have to test an 
ending regex before we permute the line in such a way as to create false 
positives or false negatives on the regex match.  I vaguely recall that the 
horrible fiddly old_match stuff was involved in the sequencing too, but it was 
something like 5 years ago now and I'd have to spend 15 minutes staring at it 
to say anything useful...)

The code also (at least used to) continue/break from more than one place, so 
we wanted to set the value that would be remembered early rather than trying 
to make sure there was no loop ending path that would leave it un-updated.  So 
you needed to know the match value _after_ seeing if this line started a 
match, _before_ seeing if this line ended a match, and deferring updating the 
match value both produced a couple bytes bigger code and meant you had to 
check multiple codepaths for correctness.

Keep in mind that sed was the first applet I seriously redid for busybox, and I 
hadn't gotten quite so anal about reproducibility and regression testing back 
then.  I had lots and lots of tests comparing the gnu and busybox sed 
behaviors, but once I got 'em to pass I threw them away.  I've regretted them 
ever since, and doing a proper sed testsuite was a todo item of mine for 
several years, but I'd estimate I threw away over fifty tests.

For example, just getting the "last line of the file does not end with a 
newline" behavior to match gnu involved a dozen or so different tests.  What 
happens if that line without a newline has some of its contents replaced, does 
it get a newline appended to it?  Does the lack of newline attach to the hold 
buffer as an attribute?  What happens if you have multiple input files and the 
line without a newline is at the end of one of the earlier files?  (In the 
multiple input file case, if the line without a newline is the last thing 
output because nothing matched out of the later files, there's still no 
newline.  But is soon as you output another line for any reason, the newline 
is retroactively added.  Oh, and there was something funky to do with blank 
lines in here too, but it was any years ago.)

If I used a dozen tests during development, could I trim that down to maybe 4 
or 5 important ones that cover all the cases?  What about the behaviors I 
tested that _might_ be strange but turned out to behave obviously, keep those 
tests?  (And this was before I even found out there was a sed spec online...)

> > Ok, buried in the random wordwrapping changes, you're now calling regfree
> > in the middle of a sed, which seems way overkill.
>
> sed_cmd->in_match = !(complex multi line expression)

*shrug*  That was the way that produced the smallest code, back in 2003 when I 
was brand new to busybox development.  (I believe it was originally an if/else 
staircase of some kind, which produced significantly larger code.)

> was already somewhat hard to understand. When I needed to throw in
> even more into it, it got really bad. So I split it into
> separate statements:

It was the "does this need to become significantly bigger" part I was hung up 
on.

> > If we're in a match, and beg_line > 0, set beg_line = -1.  That's all you
> > need to do to prevent the match from triggering again after we end it.
> >  You can even do that test every time, it's small and simple and really
> > cheap, happens totally in L1 cache, and more or less parallelizes away on
> > a modern processor with multiple execution cores.
>
> This makes sense. Except for now it can erroneously match here:
>
>                         /* Or did we match last line of input? */
>
>                         || (sed_cmd->beg_line == -1 && next_line == NULL);
>
> I guess I can try to use -2

Oh right, that's what -1 means.  I'd forgotten.  (Last email I originally said 
INT_MAX because I didn't remember if any of the negative values meant 
something, but that was horrible and I changed it before hitting send, and 
then forgot I hadn't checked that -1 wasn't used for something else already.  
I very vaguely recall that the beg_line > 0 tests were all there to make sure 
we don't barf on a file with > 2 billion lines, since the line we're matching 
can never be 0 unless it wraps all the way around.  Whether that's interesting 
is an open question, I always assumed somebody else would remove it in a code 
review pass at some point.  I don't remember if I put that in or if it was in 
Glenn's code before I ever ripped sed a new one.  Now, of course, they allow 
<= to be a free change that doesnt' increase the size...)

This is why I was relucatant to submit an actual patch, because A) It's been 
something like five years since I seriously looked at this code so all my 
memories of it are vague impressions that "this bit was important" or "there 
used to be some reason for that", B) it's changed a lot since 2003 when it was 
the first applet I seriously rewrote for busybox, so I dunno what's still 
relevant C) it was always darn fiddly making a small sed, since the problem 
space sed attacks isn't small in the first place and adding insufficiently 
documented gnu extensions on top of that doesn't improve matters.  (Yes, I 
actually read the info pages for gnu sed when originally attacking this 
monster.  And the sort info pages too.)

Unfortunately, sitting down and properly refamiliarizing myself with the sed 
code woult take at least 4 interrupted hours, and it's much, much, much easier 
to find 15 minute increments to work on stuff.  (Also I have to head up north on 
monday for a couple weeks, and I'm trying to finish three other things before 
then.)

(P.S. I miss the days when Erik and Glenn and Manuel were still around, and 
understood all this code better than I did, so half the time I could just make 
puppy eyes at them and they'd fix it.  Sorry if I'm treating you that way.  I 
also miss the days when I could retire to a corner, focus on one single thing 
for three weeks and really get to know what I was _doing_ there.  It's hard to 
do that on five projects at once.)

> >>        /* Okay, so did this line match? */
> >>-       if (sed_cmd->invert ? !matched : matched) {
> >>+       if (sed_cmd->invert ? matched : !matched)
> >>+           continue; /* no */
> >>+
> >
> > Um, not noise.  This is, in fact, a functional change totally unrelated
> > to the rest of the patch of the kind that tends to cause subtle bugs,
> > which I used to be very sensitive to in sed because it's so complicated
> > and fiddly.
> >
> > You left this change in, but didn't change the indentation of everything
> > that used to be a block, so the indentation is now misleading.
>
> I did. You complained that the patch is messed up. So I rediffed it with
> diff -bB to make it look better. Now you complain about *that*.

It's fundamentally a 2 line fix.  (Whether it needs to be set it to -1 or -2 to 
disable it doesn't change that.)  Finding it difficult to revert pages of 
unrelated changes from a 2 line fix suggests that maybe coming up with a fresh 
2 line fix from scratch, not including all the unrelated changes, would be the 
easy way.  (That's the mindset I was approaching this from, anyway.)

At the time, I didn't know what else your code was doing.  Thus with all the 
time that's passed and changes the code has undergone since, I assumed I'd 
probably missed some subtlety (as it turns out I did) and thus it might 
instead be a 5 line, or even 10 line fix.  (Still not big enough you couldn't 
easily cut _just_that_ out, so I could see it.)

Your change actually was only a dozen or so relevant lines, buried in several 
times as many lines of completely unrelated changes.  I've now read through 
one of those patches to see what it was doing, reverting more of the unrelated 
costmetic changes probably doesn't help at this point.

Rob

P.S.  I'm trying to do code review in areas that I at least used to have 
something to contribute in.  I can stop if it bothers you.
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds


More information about the busybox mailing list