bug in busybox sed with non-ascii chars

Sun May 4 14:57:41 UTC 2014

On Sun, May 04, 2014 at 04:44:10PM +0200, Denys Vlasenko wrote:
> On Sat, May 3, 2014 at 5:07 PM, Rich Felker <dalias at libc.org> wrote:
> >> Lets refuse to find end of line if there is a non UTF-8 sequence inside that line?
> >> Sounds wrong to me...
> >
> > sed (also regcomp and regexec) requires text input. Byte streams with
> > illegal sequences are not text. Actually since the regex is not trying
> > to match the illegal sequence, just the end-of-line, it would
> > theoretically be possible to make this work (and it will once we
> > overhaul the regex implementation to work with byte-based DFA's rather
> > than character-based ones), but that doesn't change the fact that it's
> > an invalid test.
> 
> Language lawyering is less important that real world usage.

Indeed it's nice to support additional real-world usage when doing so
does not harm any other usage. But we're not talking about real-world
usage here. We're talking about a buggy configure test.

I'd love to improve or even rewrite the regex engine but that's a lot
of work and lower priority than a number of other things on the musl
roadmap.

Rich