How do I (unconditionally) enable unicode support in busybox?

Harald Becker ralda at gmx.de
Mon Aug 11 15:15:21 UTC 2014


Hi James!

>> export LANG=en_US.UTF-8
>> echo -n "$*" | wc -m
>
> Yes! That works with both glibc and uclibc in the chroot and in
> the initrd.  Thank you!

Fine, at least we could solve your major problem!

You may try: LANG=... wc -m (may be that works too).

> I didn't know about the -m option to wc.  I had ASSumed -c was
> for counting "c"haracters.  My bad.

Nothing wrong. Nobody can ever know everything.

> IMO there is still something very strange with sed and unicode

YES! I did not stop looking for this. Looks like this is a problem in 
the regular expression parser.

s /./x/g

shall match every character and replace with a single x, but indeed it 
matches every byte of UTF-8 characters too (which is wrong). But this 
doesn't seam to depend on setting of LANG (which confused me). Is it 
possible, it only worked when BB is linked with glibc in a fully 
functional environment. Maybe than an UTF-8 aware regex scanner is used. 
We need to look further on this!

In addition on UTF-8 locales other awk display number of characters for 
the length() function. BB awk always displays number of bytes. Don't 
know which is right, just detected the difference. It belongs to same 
type of problem, reported by James. Only his initial detection of the 
reason seams to be not fully correct.

> it is broken but that it sometimes works.  I can send you my
> glibc version of busybox where the sed always works from my
> command line and sometimes works in the initrd.

Not required, I finally managed to reproduce your problem :)

> Thank you!  I really appreciate your help and your patience.

Sorry, when I initially stuck on the LANG question. Forget to focus on 
your major problem.

> I feel like a kid in a candy store with this mailing list
> although I am starting to get a tummy ache.  I've had a blast and
> I've learned a lot but I hope I can tear myself away in order to
> deal with some other pressing things that are on my plate.

Hey, you were right. There is a BB problem, but just to learn once more. 
Give the complete picture of what you are doing in which environment, 
else it may be difficult to detect the misbehaving problem and much to 
easy to stuck on a "user error" topic.

And again to note by others: James was right. There is at least on 
UTF-8 related problem in BB sed. Don't know what exactly fails, but 
regex matches wrong for uClibc linked version of BB. The dot "." shall 
match single characters not bytes. Anybody here who knows more on this 
topic?

--
Harald



More information about the busybox mailing list