How do I (unconditionally) enable unicode support in busybox?

James Bowlin bitjam at gmail.com
Mon Aug 11 01:46:41 UTC 2014


On Sun, Aug 10, 2014 at 11:12 PM, Harald Becker said:
> How shall your final BB manage UTF if you disable locale stuff in the 
> lib. 

I created a tr hack that gives the length of a unicode string even if
there is no unicode support.  All I need is to get the length of
unicode strings in characters, not bytes.  Here is the hack:

TR='\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217'
TR=$TR'\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237'
TR=$TR'\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257'
TR=$TR'\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277'

str_len() {
    echo -n "$*" | tr -d $TR | wc -c
}

This always works.  If you look at the table here:
https://en.wikipedia.org/wiki/Utf-8#Description

you will see that my tr expression removes byte-2 through byte-6
which lets "wc -c" count characters instead of bytes.  The $TR
variable contains all 64 bytes that start with binary "10".
The numbers in $TR are in octal.

The following usually does not work unless I am on the command
line or I use "lang=xxxx" as a boot parameter in the initrd:

    echo -n "$*" | sed 's/./x/g' | wc -c

No amount of exporting or exec'ing has helped.  The only way I
have been able to get this to work in the initrd is with the
"lang=xxxx" boot parameter.  I realize this may be difficult to
believe.  My test program merely tries various ways to get the
length of a unicode string along with a unicode string in Greek
for testing.  Here is the string I've been using:

x="Καλώς ήρθατε στο"

It has 30 bytes and 16 unicode characters.  I don't know if it
will survive the email system.  I could give it to you in hex or
octal but I don't know how to convert that back into unicode.
Ah, xxd does the trick.  Set:

y="0000000: ce9a ceb1 cebb cf8e cf82 20ce aecf 81ce
0000010: b8ce b1cf 84ce b520 cf83 cf84 cebf"

and then run:

    echo "$y" | xxd -r

The embedded newline in $y is required.

> Looks like you are building your BB the wrong way.

Could well be.  I just don't know what changes I can make to fix
it.  I am pursuing this mostly because it seems to be a problem
in BB, (even if that problem is non-obvious config parameters
needed to get the unicode support I want to work reliably).  ISTM
there should be an easy and obvious way to unconditionally enable
unicode support.  It is only because I can get it to work
sometimes (even with no locale files) and because there seems to
be code for it in unicode.c that I have pursued this as far as I
have.  ISTM that at worst, exporting LANG=en_US.UTF-8 (or even
LANG=utf, according to unicode.c) should enable unicode support,
especially if it is combined with exec.  But this *never* works
in the chroot or in the initrd.  Using the boot parameter
"lang=en_US.UTF-8" makes it work in the initrd although this is
fragile.

I am dubious about using a pre-built busybox because I am picky
about the config.  There are some less often used commands that I
need and there are many often used commands that I don't need and
don't want.

My system boots with the new BB I made with uclibc and it
contains the selection of commands I want.  I haven't tested every
feature but given all of my previous experience I don't expect
any problems.  I've been using this BB config for years with only
small variations to include new commands as I need them.  The
only problem I've had with it is enabling unicode support.

Buildroot exposes a very limited set of uclibc and binutils
options:

      *** uClibc Options ***
      uClibc C library Version (uClibc 0.9.32.x)  --->
   (package/uclibc/uClibc-0.9.32.config) uClibc configuration file to use?
  [*] Enable large file (files > 2 GB) support
  [ ] Enable IPv6 support
  [ ] Enable RPC support
  [*] Enable WCHAR support
  [ ] Enable toolchain locale/i18n support
      Thread library implementation (linuxthreads (stable/old))  --->
  [ ] Thread library debugging
  [ ] Enable stack protection support
  [ ] Compile and install uClibc utilities
  [ ] Compile and install uClibc tests
      *** Binutils Options ***
      Binutils Version (binutils 2.22)  --->
  ()  Additional binutils options
      *** GCC Options ***
      GCC compiler Version (gcc 4.7.x)  --->
  ()  Additional gcc options
  [ ] Enable C++ support
  [ ] Enable compiler OpenMP support
  [ ] Enable libmudflap support
  [ ] Build cross gdb for the host
  [ ] Purge unwanted locales
  ()  Generate locale data
  (-pipe) Target Optimizations
  ()  Target linker options
  [ ] Register toolchain within Eclipse Buildroot plug-in

The WCHAR support is explained as:

   Enable this option if you want your toolchain to support
   wide characters (i.e characters longer than 8 bits, for
   locale support)

WCHAR is needed by locale but it should not also need locale.

Here is my test-unicode script.  It assumes busybox was installed
in /live/bin.  

test-unicode:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#!/live/bin/sh

PATH=/live/bin

#[ "$LANG" = "en_US.UTF-8" ] || LANG=en_US.UTF-8 exec "$0" "$@"

x="Καλώς ήρθατε στο"

TR='\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217'
TR=$TR'\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237'
TR=$TR'\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257'
TR=$TR'\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277'

str_len() {
    printf "%15s: " "before export"
    echo -n "$*" | sed 's/./x/g'     | wc -c

    export LANG=en_US.UTF-8

    printf "%15s: " "after export"
    echo -n "$*" | sed 's/./x/g'     | wc -c

    printf "%15s: " "using tr hack"
    echo -n "$*" | tr -d $TR         | wc -c
}

str_len "$x"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

My bb-chroot code just bind mounts /dev, /sys, and /proc under /live
and then creates a symlink:

    ln -s . /live/live

so I don't have to change the shebang or the PATH in the test
script.  I do the chroot:

    [linux32] chroot /live /bin/sh

And then when you exit the shell I unmount the 3 bind mounts.

bb-chroot:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#!/bin/sh

dir=${1:-/live}

mkdir -p $dir/sys $dir/proc $dir/dev

ln -sf . $dir/live 2>/dev/null

mountpoint -q $dir/sys  || mount --bind /sys  $dir/sys
mountpoint -q $dir/proc || mount --bind /proc $dir/proc
mountpoint -q $dir/dev  || mount --bind /dev  $dir/dev

chroot $dir /bin/sh

umount $dir/dev
umount $dir/sys
umount $dir/proc
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Peace, James


More information about the busybox mailing list