bug in busybox sed with non-ascii chars

Rich Felker dalias at libc.org
Wed May 7 21:20:06 UTC 2014


On Mon, May 05, 2014 at 08:08:32PM +0100, Sam Liddicott wrote:
> One of the advantages of utf-8 encoding was that it was easy to re-sync
> after an invalid sequence.
> 
> It's a bit of a waste to then not do that. Minus points for musl.

An application can resync, although the C multibyte interfaces are not
really designed to be used this way (and you have to be careful if the
locale's encoding might be state-dependent, e.g. some legacy CJK
encodings). However the implementation cannot silently resync behind
your back. Doing so introduces serious bugs, some of which may be
security-relevant, since you either silently miss seeing some bytes
from the input when processing input via conversion to wide
characters, or some invalid sequences appear to the application as
valid. Either possibility is dangerous. In particular, it's wrong for
the regex "." to match anything that's an illegal sequence, and wrong
for "^.*$" to match a line containing any illegal sequences (since the
"." can't match it).

> Can you not run sed with LANG=C or LANG=POSIX?

That's not what they're doing, but it's not a solution anyway. ISO C
leaves the character encoding of the C locale implementation-defined,
and the Rationale text from the 1995 amendments to C explicitly allows
for the possibility that the C locale's character encoding has
multibyte characters (e.g. is UTF-8).

musl presently does not support byte-based characters at all, only
UTF-8. This conforms to the current versions of ISO C and POSIX, but
the Austin Group has adopted a requirement that the C locale be "8
bit clean" as a future requirement, which musl will probably support
at some time in the future.

Rich


More information about the busybox mailing list