How do I (unconditionally) enable unicode support in busybox?
Harald Becker
ralda at gmx.de
Mon Aug 11 15:15:21 UTC 2014
Hi James!
>> export LANG=en_US.UTF-8
>> echo -n "$*" | wc -m
>
> Yes! That works with both glibc and uclibc in the chroot and in
> the initrd. Thank you!
Fine, at least we could solve your major problem!
You may try: LANG=... wc -m (may be that works too).
> I didn't know about the -m option to wc. I had ASSumed -c was
> for counting "c"haracters. My bad.
Nothing wrong. Nobody can ever know everything.
> IMO there is still something very strange with sed and unicode
YES! I did not stop looking for this. Looks like this is a problem in
the regular expression parser.
s /./x/g
shall match every character and replace with a single x, but indeed it
matches every byte of UTF-8 characters too (which is wrong). But this
doesn't seam to depend on setting of LANG (which confused me). Is it
possible, it only worked when BB is linked with glibc in a fully
functional environment. Maybe than an UTF-8 aware regex scanner is used.
We need to look further on this!
In addition on UTF-8 locales other awk display number of characters for
the length() function. BB awk always displays number of bytes. Don't
know which is right, just detected the difference. It belongs to same
type of problem, reported by James. Only his initial detection of the
reason seams to be not fully correct.
> it is broken but that it sometimes works. I can send you my
> glibc version of busybox where the sed always works from my
> command line and sometimes works in the initrd.
Not required, I finally managed to reproduce your problem :)
> Thank you! I really appreciate your help and your patience.
Sorry, when I initially stuck on the LANG question. Forget to focus on
your major problem.
> I feel like a kid in a candy store with this mailing list
> although I am starting to get a tummy ache. I've had a blast and
> I've learned a lot but I hope I can tear myself away in order to
> deal with some other pressing things that are on my plate.
Hey, you were right. There is a BB problem, but just to learn once more.
Give the complete picture of what you are doing in which environment,
else it may be difficult to detect the misbehaving problem and much to
easy to stuck on a "user error" topic.
And again to note by others: James was right. There is at least on
UTF-8 related problem in BB sed. Don't know what exactly fails, but
regex matches wrong for uClibc linked version of BB. The dot "." shall
match single characters not bytes. Anybody here who knows more on this
topic?
--
Harald
More information about the busybox
mailing list