Possible Unicode Problems in Busybox - Collect and Discussion

Rich Felker dalias at libc.org
Thu Aug 14 06:31:02 UTC 2014


On Wed, Aug 13, 2014 at 07:06:38PM +0200, Harald Becker wrote:
> Hi Denys !
> 
> > The world seems to be standardizing on utf-8.
> >Thank God, supporting gazillion of encodings is no fun.
> 
> You say this, but libbb/unicode.c contains a unicode_strlen calling
> this complex mb to wc conversion function to count the number of
> characters. Those multi byte functions tend to be highly complex and
> slow (don't know if they have gone better). For just UTF-8, things
> can be optimized.

This depends on your libc. In musl, the only thing slow about them is
having to account for some idiotic special-cases the standard allows
(special meanings for null pointers in each of the arguments) and even
that should not be slow on machines with proper branch prediction.

> e.g.
> 
> size_t utf8len( const char* s )
> {
>   size_t n = 0;
>   while (*s)
>     if ((*s++ ^ 0x40) < 0xC0)
>       n++;
>   return n;
> }

This function is only valid if the string is known to be valid UTF-8.
Otherwise it hides errors, which may or may not be problematic
depending on what you're using it for.

> Another fast function I use for UTF-8 ... skip to Nth UTF-8
> character in a string (returns a pointer to trailing \0 if N >
> number of UTF-8 chars in string):
> 
> char *utf8skip( char const* s, size_t n )
> {
>   for ( ; n && *s; --n )
>     while ((*++s ^ 0x40) >= 0xC0);
>   return (char*)s;
> }

This code is invalid; it's assuming char is unsigned. In practice,
*++s ^ 0x40 is going to be negative on most archs. Better would be
doing an unsigned range check like (unsigned char)*++s-0x80<0x40U.

Of course it also gets tripped up badly on invalid sequences.

Rich


More information about the busybox mailing list