Cyrillic letters proplem

Rob Landley rob at landley.net
Thu Jan 19 20:53:03 UTC 2006


On Wednesday 18 January 2006 17:52, Aurelien Jacobs wrote:
> > What exactly supporting UTF-8 requires above and beyond being 8-bit clean
> > is something I'm still a little unclear on, hence the TODO item when I
> > have time and inclination to learn about it (or somebody else gets
> > inspired).
>
> Just an example of what need to be done :
> If you feed some UTF-8 strings to the sort command, it can't simply compare
> bytes to do it's job. It has to decode the UTF-8 into unicode character's
> code point. It can then compare the code points to do it's sort.

It would do this how?  (I dunno what the UTF-8 decode/encode/comparison 
thingies would be.  If there's a utf8_strcmp(), that's a possibility...)

Glancing at sort.c (since I wrote the one that's in there now)... 

On characters we use isalnum(), toupper(), and we expect key_separator to be 8 
bit.  On strings we use strcmp(), strtod(), strptime(), strtol(), and 
index().  All of those might conceivably care.  (We also use isspace() and 
isprint() but if UTF8 cares about that then it's broken.)

That's what I can see.

> There's probably plenty of other things to modify for UTF-8.

Well, busybox is never going to have full internationalization for multiple 
locales (unless we inherit it for free from our libc).  If any of the UTF-8 
support things we do add increase the size of the code, we'll have a global 
CONFIG option to support UTF-8.

On the other hand, other than being 8 bit clean, it's not that high a 
priority.  We've got lots of other things on the todo list that shouldn't 
increase size or complexity of existing code.

> Aurel

Rob
-- 
Steve Ballmer: Innovation!  Inigo Montoya: You keep using that word.
I do not think it means what you think it means.



More information about the busybox mailing list