Fixing unicode detection

Rich Felker dalias at aerifal.cx
Mon Jul 1 03:24:16 UTC 2013


On Sun, Jun 30, 2013 at 01:28:59PM +0200, Denys Vlasenko wrote:
> On Sunday 30 June 2013 03:01, Rich Felker wrote:
> > I just submitted a bug report
> > (https://bugs.busybox.net/show_bug.cgi?id=6356) and a proposed partial
> > fix for busybox's unicode detection.
> 
> You forgot to describe what the actual problem is...
> 
> I am resorting to guessing here.
> 
> You want "LC_ALL=en_US.UTF-8" to work, but it doesn't?

I want any combination of locale environment variables that would lead
to mbrtowc processing input as UTF-8 after a call to
setlocale(LC_CTYPE,"") to put busybox into "unicode mode" (UTF-8
handling). This is required from a conformance standpoint.

Aside from the obvious standard ways one could request (for example)
en_US.UTF-8 for the CTYPE category (using LANG, LC_CTYPE, or LC_ALL),
it's also possible (implementation-defined) that even after calling
setlocale(LC_CTYPE,"") with NO variables set, the ctype encoding is
UTF-8. Since this behavior is implementation-defined, you can't
emulate it by processing the variables; you really have to pass "" to
setlocale to get it. And POSIX does require all the standard utilities
to operate as if they called setlocale(LC_ALL,"").

> > To elaborate on the issue, UTF-8 
> > support will not be enabled unless the LANG environment variable
> > contains the name of a locale that's UTF-8-based; the rest of the
> > standard locale logic based on the LC_* variables is overridden. For
> > example if you leave LANG unset and just set LC_CTYPE or LC_ALL to a
> > UTF-8 locale, busybox will ignore them and use the "C" locale.
> > 
> > I've never used the LANG variable,
> 
> I just looked what Fedora does and the only sign of Unicode
> in the environment is "LANG=en_US.UTF-8", no LC_* variables are set.

That's just one way to set it. See:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html

and:

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08

under section 8.2 Internationalization Variables, which describes the
way they LANG and LC_* variables work.

> > char *loc;
> > (loc = getenv("LC_ALL")) ||
> > (loc = getenv("LC_CTYPE")) ||
> > (loc = getenv("LANG")) ||
> > (loc = "");
> > setlocale(LC_CTYPE, loc);
> 
> I tend to not depend on localized ctype functions in busybox,
> since for the most important locale, UTF-8, they don't work anyway.

This code has nothing to do with the ctype functions. LC_CTYPE is the
locale category that determines the character encoding. That's all
it's being used for.

> I open-code two-way conditionals: we are either in ASCII or in Unicode.
> This should cover ~99.99999% of all users.

I understand this, though I believe you mean "in an 8bit legacy
codepage or in UTF-8" (not ASCII). ASCII is 7bit and would have the
utilities all erroring out with EILSEQ on encountering a high byte.
:-)

I'm not asking for support for other character encodings, just for
correct detection of whether the user's configured locale is
UTF-8-based or not.

> Are you concerned that sometimes busybox doesn't detect that it's
> running in "Unicoded" environment,

Precisely. I'm sorry that I was not more clear in stating this.

> or do you want to support
> some other setup (non-C and non-Unicode? Mixed setup for different
> LC_* categories?)?
> 
> > if the variables are unset in the shell but still in the environment,
> 
> This never happens in shells AFAIK...

OK.

Rich


More information about the busybox mailing list