How do I (unconditionally) enable unicode support in busybox?
James Bowlin
bitjam at gmail.com
Mon Aug 11 01:46:41 UTC 2014
On Sun, Aug 10, 2014 at 11:12 PM, Harald Becker said:
> How shall your final BB manage UTF if you disable locale stuff in the
> lib.
I created a tr hack that gives the length of a unicode string even if
there is no unicode support. All I need is to get the length of
unicode strings in characters, not bytes. Here is the hack:
TR='\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217'
TR=$TR'\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237'
TR=$TR'\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257'
TR=$TR'\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277'
str_len() {
echo -n "$*" | tr -d $TR | wc -c
}
This always works. If you look at the table here:
https://en.wikipedia.org/wiki/Utf-8#Description
you will see that my tr expression removes byte-2 through byte-6
which lets "wc -c" count characters instead of bytes. The $TR
variable contains all 64 bytes that start with binary "10".
The numbers in $TR are in octal.
The following usually does not work unless I am on the command
line or I use "lang=xxxx" as a boot parameter in the initrd:
echo -n "$*" | sed 's/./x/g' | wc -c
No amount of exporting or exec'ing has helped. The only way I
have been able to get this to work in the initrd is with the
"lang=xxxx" boot parameter. I realize this may be difficult to
believe. My test program merely tries various ways to get the
length of a unicode string along with a unicode string in Greek
for testing. Here is the string I've been using:
x="Καλώς ήρθατε στο"
It has 30 bytes and 16 unicode characters. I don't know if it
will survive the email system. I could give it to you in hex or
octal but I don't know how to convert that back into unicode.
Ah, xxd does the trick. Set:
y="0000000: ce9a ceb1 cebb cf8e cf82 20ce aecf 81ce
0000010: b8ce b1cf 84ce b520 cf83 cf84 cebf"
and then run:
echo "$y" | xxd -r
The embedded newline in $y is required.
> Looks like you are building your BB the wrong way.
Could well be. I just don't know what changes I can make to fix
it. I am pursuing this mostly because it seems to be a problem
in BB, (even if that problem is non-obvious config parameters
needed to get the unicode support I want to work reliably). ISTM
there should be an easy and obvious way to unconditionally enable
unicode support. It is only because I can get it to work
sometimes (even with no locale files) and because there seems to
be code for it in unicode.c that I have pursued this as far as I
have. ISTM that at worst, exporting LANG=en_US.UTF-8 (or even
LANG=utf, according to unicode.c) should enable unicode support,
especially if it is combined with exec. But this *never* works
in the chroot or in the initrd. Using the boot parameter
"lang=en_US.UTF-8" makes it work in the initrd although this is
fragile.
I am dubious about using a pre-built busybox because I am picky
about the config. There are some less often used commands that I
need and there are many often used commands that I don't need and
don't want.
My system boots with the new BB I made with uclibc and it
contains the selection of commands I want. I haven't tested every
feature but given all of my previous experience I don't expect
any problems. I've been using this BB config for years with only
small variations to include new commands as I need them. The
only problem I've had with it is enabling unicode support.
Buildroot exposes a very limited set of uclibc and binutils
options:
*** uClibc Options ***
uClibc C library Version (uClibc 0.9.32.x) --->
(package/uclibc/uClibc-0.9.32.config) uClibc configuration file to use?
[*] Enable large file (files > 2 GB) support
[ ] Enable IPv6 support
[ ] Enable RPC support
[*] Enable WCHAR support
[ ] Enable toolchain locale/i18n support
Thread library implementation (linuxthreads (stable/old)) --->
[ ] Thread library debugging
[ ] Enable stack protection support
[ ] Compile and install uClibc utilities
[ ] Compile and install uClibc tests
*** Binutils Options ***
Binutils Version (binutils 2.22) --->
() Additional binutils options
*** GCC Options ***
GCC compiler Version (gcc 4.7.x) --->
() Additional gcc options
[ ] Enable C++ support
[ ] Enable compiler OpenMP support
[ ] Enable libmudflap support
[ ] Build cross gdb for the host
[ ] Purge unwanted locales
() Generate locale data
(-pipe) Target Optimizations
() Target linker options
[ ] Register toolchain within Eclipse Buildroot plug-in
The WCHAR support is explained as:
Enable this option if you want your toolchain to support
wide characters (i.e characters longer than 8 bits, for
locale support)
WCHAR is needed by locale but it should not also need locale.
Here is my test-unicode script. It assumes busybox was installed
in /live/bin.
test-unicode:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#!/live/bin/sh
PATH=/live/bin
#[ "$LANG" = "en_US.UTF-8" ] || LANG=en_US.UTF-8 exec "$0" "$@"
x="Καλώς ήρθατε στο"
TR='\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217'
TR=$TR'\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237'
TR=$TR'\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257'
TR=$TR'\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277'
str_len() {
printf "%15s: " "before export"
echo -n "$*" | sed 's/./x/g' | wc -c
export LANG=en_US.UTF-8
printf "%15s: " "after export"
echo -n "$*" | sed 's/./x/g' | wc -c
printf "%15s: " "using tr hack"
echo -n "$*" | tr -d $TR | wc -c
}
str_len "$x"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
My bb-chroot code just bind mounts /dev, /sys, and /proc under /live
and then creates a symlink:
ln -s . /live/live
so I don't have to change the shebang or the PATH in the test
script. I do the chroot:
[linux32] chroot /live /bin/sh
And then when you exit the shell I unmount the 3 bind mounts.
bb-chroot:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#!/bin/sh
dir=${1:-/live}
mkdir -p $dir/sys $dir/proc $dir/dev
ln -sf . $dir/live 2>/dev/null
mountpoint -q $dir/sys || mount --bind /sys $dir/sys
mountpoint -q $dir/proc || mount --bind /proc $dir/proc
mountpoint -q $dir/dev || mount --bind /dev $dir/dev
chroot $dir /bin/sh
umount $dir/dev
umount $dir/sys
umount $dir/proc
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Peace, James
More information about the busybox
mailing list