Fixing unicode detection
dalias at aerifal.cx
Wed Jul 3 17:15:40 UTC 2013
On Tue, Jul 02, 2013 at 05:25:28PM +0200, Denys Vlasenko wrote:
> On Mon, Jul 1, 2013 at 5:24 AM, Rich Felker <dalias at aerifal.cx> wrote:
> > I want any combination of locale environment variables that would lead
> > to mbrtowc processing input as UTF-8 after a call to
> > setlocale(LC_CTYPE,"") to put busybox into "unicode mode" (UTF-8
> > handling). This is required from a conformance standpoint.
> I'm going to add check for $LC_ALL.
> What are the chances that someone doesn't set $LANG, $LC_ALL,
> but does set $LC_CTYPE?
Extremely high. Setting only LC_CTYPE is the way you get UTF-8 without
the other aspects of locale (like case-insensitive collation order,
wrong decimal points, etc.) that most people hate.
> > Aside from the obvious standard ways one could request (for example)
> > en_US.UTF-8 for the CTYPE category (using LANG, LC_CTYPE, or LC_ALL),
> > it's also possible (implementation-defined) that even after calling
> > setlocale(LC_CTYPE,"") with NO variables set, the ctype encoding is
> > UTF-8.
> > Since this behavior is implementation-defined, you can't
> > emulate it by processing the variables; you really have to pass "" to
> > setlocale to get it.
> Take a look at the code. At #ifs around that place:
> /* Homegrown Unicode support. It knows only C and Unicode locales. */
> I want to be able to conditionally *not use setlocale at all*
> (for one, I use uclibc configured w/o locale, for size reasons),
> and yet, I want Unicode to work.
> (To make that possible, I roll my own wcrtomb et al).
> Therefore, "how to call setlocale() correctly" is a nonsensical
> question in some busybox configs.
Agreed. My bug report is for configurations that use setlocale. Note
that with musl, using setlocale will result in a much smaller busybox
binary than duplicating the UTF-8 code in busybox would.
I really have no opinion on how the configuration option for bypassing
setlocale "should" work, since I don't use it.
> >> Are you concerned that sometimes busybox doesn't detect that it's
> >> running in "Unicoded" environment,
> > Precisely. I'm sorry that I was not more clear in stating this.
> Does addition of LC_ALL check make your broken case work?
No. It doesn't fix either of the cases I care about:
1. Only LC_CTYPE is set and all others are unset. This is the standard
way to get UTF-8 but no other (possibly undesirable) locale features
when using glibc or uClibc.
2. No variables at all are set. This is the case that will matter to
musl users after the Austin Group interpretation for issue #663 goes
through and we're stuck implementing an 8-bit C locale. After that,
calling setlocale(LC_*,"") will give the desired UTF-8 behavior, where
as failure to call setlocale at all, or calling it with "C" as the
second argument, would disable UTF-8.
More information about the busybox