Possible Unicode Problems in Busybox - Collect and Discussion

Rich Felker dalias at libc.org
Thu Aug 14 17:32:12 UTC 2014


On Thu, Aug 14, 2014 at 07:14:52PM +0200, Harald Becker wrote:
> Hi Rich!
> 
> >> You say this, but libbb/unicode.c contains a unicode_strlen calling
> >>this complex mb to wc conversion function to count the number of
> >>characters. Those multi byte functions tend to be highly complex and
> >>slow (don't know if they have gone better). For just UTF-8, things
> >>can be optimized.
> >
> >This depends on your libc.
> 
> .... that is, why I added "don't know if gone better" ... really
> good when musl is fast on this ... the problem is BB is more likely
> linked with glibc or uClibc ... there the results are not so great
> :(

I think uClibc is pretty fast at this too. It's glibc that's horribly
slow. Rough comparison:

For processing a full string buffer, musl is roughly twice as fast as
uClibc, and uClibc is roughly twice as fast as glibc.

For byte-by-byte processing: musl is roughly 3x as fast as uClibc and
roughly 4x as fast as glibc.

Source: my comparison at http://www.etalabs.net/compare_libcs.html

Presumably you would use a full string operation here (mbstowcs with
null output pointer) for computing length in characters.

> >>size_t utf8len( const char* s )
> >>{
> >>   size_t n = 0;
> >>   while (*s)
> >>     if ((*s++ ^ 0x40) < 0xC0)
> >>       n++;
> >>   return n;
> >>}
> >
> >This function is only valid if the string is known to be valid UTF-8.
> 
> Yes, I told it's for UTF-8.

Yes, but there's a difference between "nominally UTF-8" and
"known-valid UTF-8".

> >Otherwise it hides errors, which may or may not be problematic
> >depending on what you're using it for.
> 
> If you know you are using UTF-8 you do not need to check every
> string over and over again, else it's pure paranoia. It is robust,
> as it will not run away on anything which is valid C string.

Well if the string comes from a source outside of your control, you
need to check it at least once. But you might not want to check and
reject it at the original point of input, e.g. if you want to be able
to preserve arbitrary byte sequences that might not be UTF-8, e.g. an
argument that's a filename in an invalid encoding which you're trying
to delete or rename to fix. So IMO it makes a lot more sense to do
your checking at the point of treating the string as a sequence of
characters, even if it happens multiple times. The cost is not high if
your implementation is efficient.

> >Of course it also gets tripped up badly on invalid sequences.
> 
> How can it get tripped? It silently skip over invalid sequences (of
> 0x80 to 0xBF until next leading of a sequence). It shall not get
> stuck in any way. Or tell me exactly how ...

By itself it's not a problem, but the interaction with other code may
be a problem if the other code does not follow exactly the same
conventions.

Rich


More information about the busybox mailing list