Fixing unicode detection

Rich Felker dalias at aerifal.cx
Sun Jun 30 01:01:38 UTC 2013


I just submitted a bug report
(https://bugs.busybox.net/show_bug.cgi?id=6356) and a proposed partial
fix for busybox's unicode detection. To elaborate on the issue, UTF-8
support will not be enabled unless the LANG environment variable
contains the name of a locale that's UTF-8-based; the rest of the
standard locale logic based on the LC_* variables is overridden. For
example if you leave LANG unset and just set LC_CTYPE or LC_ALL to a
UTF-8 locale, busybox will ignore them and use the "C" locale.

I've never used the LANG variable, and the only reason I did not
notice the issue sooner was that, on musl, the C locale is UTF-8
based, so busybox's passing "C" to setlocale does not turn off UTF-8
support. However, there's a movement going through the Austin Group
tracker to force the C locale to be 8-bit, and if that goes through
and musl follows the standard, this issue in busybox will affect me.
It presently affects any UTF-8 users who set the LC_* variables by not
the LANG variable.

In the bug report, I noted that the only way to ensure the standard
locale semantics apply is to pass "" to setlocale, but this cannot
easily facilitate dynamic locale changes in shells. One possible
solution that will give _approximately_ correct, but not entirely
correct on all implementations, semantics is the following:

char *loc;
(loc = getenv("LC_ALL")) ||
(loc = getenv("LC_CTYPE")) ||
(loc = getenv("LANG")) ||
(loc = "");
setlocale(LC_CTYPE, loc);

For supporting dynamic locale change in shells, getenv would be
replaced by the equivalent lookup using shell variables rather than
the environment.

Note that the final fallback to "" rather than "C" is important. Per
POSIX, the behavior of setlocale when "" is passed, and the behavior
of all the standard utilities, is to first check the environment
variables, and if none of them are set, to fallback to an
implementation-defined default locale. While the initial locale before
setlocale is called must be the C locale, the implementation-defined
default for "" can be anything. On musl, this default is always a
UTF-8 locale and will remain so even if we're forced to change the
plain C locale to be 8bit, so musl users in general will not be
setting LANG or any of the LC_* vars but will be expecting UTF-8 to
work, and only expecting a non-UTF-8 C locale if the LANG or LC_* vars
are explicitly set to C.

Note that one issue with passing "" to setlocale from shells is that,
if the variables are unset in the shell but still in the environment,
setlocale will act on them. The only way I know to inhibit this would
be for the shells to remove LANG and LC_* from their own environments.

If there's willingness to move forward with these changes, I can
possibly prepare a patch.

Rich


More information about the busybox mailing list