Possible Unicode Problems in Busybox - Collect and Discussion

Fri Aug 15 17:38:43 UTC 2014

On Fri, Aug 15, 2014 at 12:31:15AM +0200, Harald Becker wrote:
> >> .... and how want you behave in case of invalid UTF-8 sequences? My
> >>functions just skip over stray codes of 0x80..0xBF and synchronize
> >>on next valid UTF-8 leading byte. How would you count invalid
> >>sequences?
> >
> >In general, I would count the whole operation as a failure, returning
> >some value such as -1 reserved for failure, since the string is not
> >actually UTF-8 and thus "how many characters?" has no meaning. For
> >specific uses, there might be other preferred behaviors. If your goal
> >is display, you may want to simply replace illegal sequences with
> >U+FFFD in which case you'd count each such sequence as "1", but if
> >you're using this character-counting to allocate a buffer for the
> >converted string, you need to be sure your conversion function and
> >character-counting function agree on how illegal sequences are
> >counted, or you might overflow your buffer or end up having to
> >truncate the output.
> 
> Rich, will you ever use the result of counting the numbers of UTF-8
> characters to allocate a buffer? I don't think so. That would be
> very ill behavior. To allocate buffer space you need the number of
> bytes occupied by a string, not the number of UTF-8 characters.

If your intent is to convert the string to UTF-32/wchar_t/whatever,
then yes, you use the result for allocating a buffer. In my mind
that's the main point of counting characters (since otherwise you
usually care about either bytes, for storage, or columns, for
presentation), and while I personally consider it better to work
character-at-a-time and keep the string as UTF-8, some APIs require a
string in a different format, especially ones that work with a whole
string and prepare it for visual presentation.

The main other place counting characters makes sense is for
implementing languages that do substring operations with character
indexes, which I think is the one you care about.

> So the big question is: Is there anybody who still needs the BB
> internal Unicode handling and can't use the locale functions of a
> libc. Why and for what purpose is this needed? In which environment?

I think the intent was to let uClibc users (and possibly eglibc
users?) omit locale support from the libc, which reduces libc size
quite a bit, and use the UTF-8 code in busybox instead.

> As far as I know, the beginning of those BB internal functions,
> where at times where only glibc had locale support and there where
> no alternatives for small environments. But things changed and there
> are now alternatives. So have we reached a point, where we are able
> to simplify things in BB (which means to focus on correct mb
> function usage everywhere and to strip unnecessary decisions,
> configs and helper code)?

I wouldn't object to this change.

Rich