Possible Unicode Problems in Busybox - Collect and Discussion

Harald Becker ralda at gmx.de
Thu Aug 14 17:14:52 UTC 2014


Hi Rich!

 >> You say this, but libbb/unicode.c contains a unicode_strlen calling
>> this complex mb to wc conversion function to count the number of
>> characters. Those multi byte functions tend to be highly complex and
>> slow (don't know if they have gone better). For just UTF-8, things
>> can be optimized.
>
> This depends on your libc.

... that is, why I added "don't know if gone better" ... really good 
when musl is fast on this ... the problem is BB is more likely linked 
with glibc or uClibc ... there the results are not so great :(

>> size_t utf8len( const char* s )
>> {
>>    size_t n = 0;
>>    while (*s)
>>      if ((*s++ ^ 0x40) < 0xC0)
>>        n++;
>>    return n;
>> }
>
> This function is only valid if the string is known to be valid UTF-8.

Yes, I told it's for UTF-8.

> Otherwise it hides errors, which may or may not be problematic
> depending on what you're using it for.

If you know you are using UTF-8 you do not need to check every string 
over and over again, else it's pure paranoia. It is robust, as it will 
not run away on anything which is valid C string.

>> Another fast function I use for UTF-8 ... skip to Nth UTF-8
>> character in a string (returns a pointer to trailing \0 if N >
>> number of UTF-8 chars in string):
>>
>> char *utf8skip( char const* s, size_t n )
>> {
>>    for ( ; n && *s; --n )
>>      while ((*++s ^ 0x40) >= 0xC0);
>>    return (char*)s;
>> }
>
> This code is invalid; it's assuming char is unsigned. In practice,
> *++s ^ 0x40 is going to be negative on most archs. Better would be
> doing an unsigned range check like (unsigned char)*++s-0x80<0x40U.

Yes, I missed the type cast ... sorry, for this, see previous mail

> Of course it also gets tripped up badly on invalid sequences.

How can it get tripped? It silently skip over invalid sequences (of 0x80 
to 0xBF until next leading of a sequence). It shall not get stuck in any 
way. Or tell me exactly how ...

--
Harald




More information about the busybox mailing list