bug in busybox sed with non-ascii chars

Sam Liddicott sam at liddicott.com
Mon May 5 19:08:32 UTC 2014


One of the advantages of utf-8 encoding was that it was easy to re-sync
after an invalid sequence.

It's a bit of a waste to then not do that. Minus points for musl.

Can you not run sed with LANG=C or LANG=POSIX?

Sam
On 4 May 2014 15:57, "Rich Felker" <dalias at libc.org> wrote:

> On Sun, May 04, 2014 at 04:44:10PM +0200, Denys Vlasenko wrote:
> > On Sat, May 3, 2014 at 5:07 PM, Rich Felker <dalias at libc.org> wrote:
> > >> Lets refuse to find end of line if there is a non UTF-8 sequence
> inside that line?
> > >> Sounds wrong to me...
> > >
> > > sed (also regcomp and regexec) requires text input. Byte streams with
> > > illegal sequences are not text. Actually since the regex is not trying
> > > to match the illegal sequence, just the end-of-line, it would
> > > theoretically be possible to make this work (and it will once we
> > > overhaul the regex implementation to work with byte-based DFA's rather
> > > than character-based ones), but that doesn't change the fact that it's
> > > an invalid test.
> >
> > Language lawyering is less important that real world usage.
>
> Indeed it's nice to support additional real-world usage when doing so
> does not harm any other usage. But we're not talking about real-world
> usage here. We're talking about a buggy configure test.
>
> I'd love to improve or even rewrite the regex engine but that's a lot
> of work and lower priority than a number of other things on the musl
> roadmap.
>
> Rich
> _______________________________________________
> busybox mailing list
> busybox at busybox.net
> http://lists.busybox.net/mailman/listinfo/busybox
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.busybox.net/pipermail/busybox/attachments/20140505/cc96296e/attachment-0001.html>


More information about the busybox mailing list