Fixing unicode detection
Rich Felker
dalias at aerifal.cx
Mon Jul 1 03:24:16 UTC 2013
On Sun, Jun 30, 2013 at 01:28:59PM +0200, Denys Vlasenko wrote:
> On Sunday 30 June 2013 03:01, Rich Felker wrote:
> > I just submitted a bug report
> > (https://bugs.busybox.net/show_bug.cgi?id=6356) and a proposed partial
> > fix for busybox's unicode detection.
>
> You forgot to describe what the actual problem is...
>
> I am resorting to guessing here.
>
> You want "LC_ALL=en_US.UTF-8" to work, but it doesn't?
I want any combination of locale environment variables that would lead
to mbrtowc processing input as UTF-8 after a call to
setlocale(LC_CTYPE,"") to put busybox into "unicode mode" (UTF-8
handling). This is required from a conformance standpoint.
Aside from the obvious standard ways one could request (for example)
en_US.UTF-8 for the CTYPE category (using LANG, LC_CTYPE, or LC_ALL),
it's also possible (implementation-defined) that even after calling
setlocale(LC_CTYPE,"") with NO variables set, the ctype encoding is
UTF-8. Since this behavior is implementation-defined, you can't
emulate it by processing the variables; you really have to pass "" to
setlocale to get it. And POSIX does require all the standard utilities
to operate as if they called setlocale(LC_ALL,"").
> > To elaborate on the issue, UTF-8
> > support will not be enabled unless the LANG environment variable
> > contains the name of a locale that's UTF-8-based; the rest of the
> > standard locale logic based on the LC_* variables is overridden. For
> > example if you leave LANG unset and just set LC_CTYPE or LC_ALL to a
> > UTF-8 locale, busybox will ignore them and use the "C" locale.
> >
> > I've never used the LANG variable,
>
> I just looked what Fedora does and the only sign of Unicode
> in the environment is "LANG=en_US.UTF-8", no LC_* variables are set.
That's just one way to set it. See:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html
and:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08
under section 8.2 Internationalization Variables, which describes the
way they LANG and LC_* variables work.
> > char *loc;
> > (loc = getenv("LC_ALL")) ||
> > (loc = getenv("LC_CTYPE")) ||
> > (loc = getenv("LANG")) ||
> > (loc = "");
> > setlocale(LC_CTYPE, loc);
>
> I tend to not depend on localized ctype functions in busybox,
> since for the most important locale, UTF-8, they don't work anyway.
This code has nothing to do with the ctype functions. LC_CTYPE is the
locale category that determines the character encoding. That's all
it's being used for.
> I open-code two-way conditionals: we are either in ASCII or in Unicode.
> This should cover ~99.99999% of all users.
I understand this, though I believe you mean "in an 8bit legacy
codepage or in UTF-8" (not ASCII). ASCII is 7bit and would have the
utilities all erroring out with EILSEQ on encountering a high byte.
:-)
I'm not asking for support for other character encodings, just for
correct detection of whether the user's configured locale is
UTF-8-based or not.
> Are you concerned that sometimes busybox doesn't detect that it's
> running in "Unicoded" environment,
Precisely. I'm sorry that I was not more clear in stating this.
> or do you want to support
> some other setup (non-C and non-Unicode? Mixed setup for different
> LC_* categories?)?
>
> > if the variables are unset in the shell but still in the environment,
>
> This never happens in shells AFAIK...
OK.
Rich
More information about the busybox
mailing list