Possible Unicode Problems in Busybox - Collect and Discussion
Harald Becker
ralda at gmx.de
Thu Aug 14 17:14:52 UTC 2014
Hi Rich!
>> You say this, but libbb/unicode.c contains a unicode_strlen calling
>> this complex mb to wc conversion function to count the number of
>> characters. Those multi byte functions tend to be highly complex and
>> slow (don't know if they have gone better). For just UTF-8, things
>> can be optimized.
>
> This depends on your libc.
... that is, why I added "don't know if gone better" ... really good
when musl is fast on this ... the problem is BB is more likely linked
with glibc or uClibc ... there the results are not so great :(
>> size_t utf8len( const char* s )
>> {
>> size_t n = 0;
>> while (*s)
>> if ((*s++ ^ 0x40) < 0xC0)
>> n++;
>> return n;
>> }
>
> This function is only valid if the string is known to be valid UTF-8.
Yes, I told it's for UTF-8.
> Otherwise it hides errors, which may or may not be problematic
> depending on what you're using it for.
If you know you are using UTF-8 you do not need to check every string
over and over again, else it's pure paranoia. It is robust, as it will
not run away on anything which is valid C string.
>> Another fast function I use for UTF-8 ... skip to Nth UTF-8
>> character in a string (returns a pointer to trailing \0 if N >
>> number of UTF-8 chars in string):
>>
>> char *utf8skip( char const* s, size_t n )
>> {
>> for ( ; n && *s; --n )
>> while ((*++s ^ 0x40) >= 0xC0);
>> return (char*)s;
>> }
>
> This code is invalid; it's assuming char is unsigned. In practice,
> *++s ^ 0x40 is going to be negative on most archs. Better would be
> doing an unsigned range check like (unsigned char)*++s-0x80<0x40U.
Yes, I missed the type cast ... sorry, for this, see previous mail
> Of course it also gets tripped up badly on invalid sequences.
How can it get tripped? It silently skip over invalid sequences (of 0x80
to 0xBF until next leading of a sequence). It shall not get stuck in any
way. Or tell me exactly how ...
--
Harald
More information about the busybox
mailing list