[BusyBox] Outright insanity from gnu sed.

Mon Sep 29 00:45:33 UTC 2003

On Thursday 25 September 2003 08:17, Rob Landley wrote:
> Okay, THIS behavior of gnu sed probably isn't worth replicating unless
> somebody complains, unless you can figure out what's going on...?  (I think
> it's a bug.  the appended line is showing up before the line it was
> supposed to be appended to.  That can't be right...)

Okay, the bit about the append line showing up before the thing it's supposed 
to be appended to is right because N flushes the append buffer.  So that's 
actually right.

The _inconsistency_ is that when N hits EOF, it does NOT flush the append 
buffer.  And this explicitly violates the text of the spec:

> The text specified for the a command, and the contents of the file specified
> for the r command, shall be written to standard output just before the next
> attempt to fetch a line of input when executing the N or n commands, or when
> reaching the end of the script. 

It should flush the append buffer before the next _attempt_, not the next 
successful line read.  Gnu sed is wrong here, and I'm not implementing this 
behavior.

Speaking of which, is there any real world test case where gnu sed's "append a 
newline for N on EOF" behavior is actually depended upon?  (I.E. do we really 
_have_ to implement this?  It's amazingly ugly, and those scripts are 
depending on an admitted and documented bug.)

P.S.  I may actually know why their bug happens:

echo -e "boo" | sed -e "aboing" -e "N" -e "s/\n/kablam/"
echo -n -e "boo" | sed -e "aboing" -e "N" -e "s/\n/kablam/"

Their "add extra EOF at end" logic may be triggering when it shouldn't.  Just 
a guess, I haven't looked at their code...

In any case, I hope to post a Big Evil Patch of all the work I've done in the 
past few days to the list within the next 24 hours.  Not guaranteeing it'll 
be fully debugged yet, but I've done 8 gazillion things and need to resync 
with everybody

Random stuff I did:

Moved \n parsing down in to the areas it's supposed to be.  (We can't do the 
whole string up front because then sed -e "\nblahns/one/two/" wouldn't work, 
and a couple other similar corner cases.)

Made it so "y.abc\.def.ghijkl." should work properly.  And \n in there too.

Audited sed command parsing in general and rationalized the heck out of it.
Moved a couple small functions like parse_translate command inline if they 
were only called from one place anyway.

"One string to rule them all": there used to be 8 gazillion string fields in 
sed_cmd that were never used at the same time.  Now every sed command that 
needs one string argument is using sed_cmd->string, making the structure 
smaller and the free logic simpler.  The only one that couldn't be folded in 
is filename, because then you couldn't support "s/one/two/w filename", 
because "two" is using sed_cmd->string alrady.)

- I added match count support ( s/from/to/3 ).  This ate the s///g command, 
which is now indicated by match count=0.  (The default is 1, for first 
match.)

- Rather than allocate and free a regexp buffer on every search, I just made a 
global that's the biggest size it can ever be in our code (10 entries, 
sizeof(regmatch_t) is 8 bytes, so 80 bytes total.  Those 10 entries are the 
global match's start and end, plus backrefs 1-9.  And that's not going to get 
bigger until there are more digit keys on the keyboard or we start allowing 
\10 as a backref, which really doesn't seem worth it.  Implementation note: 
\0 is a synonym for &.  I can live with that.)  note: This may actually allow 
us to rip out all the logic for counting and tracking the number of backrefs 
in the sed_cmd (since we no longer need it for memory allocation and can tell 
at runtime that the match start is -1), but I haven't actually done that yet.

- I added the first half of s///w support (it now parses the option and stores 
the filename in sed_cmd.  The logic to actually DO it is missing because I 
have to work out when it creates an empty file and when it appends to an 
existing file.  To-do item...)

- pipeline_putc was using in-band signalling (PIPE_MAGIC).  That's just evil.  
I fixed it.

- Lots and lots of little memory leaks.  parse_translate_command wasn't 
freeing match and replace, etc.

- I made append work.  There is now an append buffer.  It gets flushed by the 
next line read, or at end of input.

- moved the line reading and line printing logic into functions which track 
whether the last line read had a newline, so that if it prints out that line 
it doesn't append a newline onto it when there wasn't one in the source file.  
(This is a gnu sed behavior that is a technical violation of the spec, which 
says there will always be a newline.  I don't care, the gnu behavior is 
correct here.  Document the deviation.)  One idiosyncrasy of this is that if 
there are multiple input files (and yes stdin can be one of them, filename 
"-"), a newline is inserted BETWEEN those files.  It's only the actual last 
line of output that can potentially leave the newline off.  The fix?  If we 
left off the newline, add an empty line to the append buffer (this is after 
the EOF flush, so it won't get flushed again unless we read another line, 
I.E. from the next file.)  So the free logic up top that iterated through 
sed_cmd now needs to know to free the append_buffer as well (which is a 
linked list).

Rob

(P.S.  This stuff still makes my brain hurt, but at least the CODE doesn't 
anymore.  That's something.  I _still_ need to finish reading the spec. :)