'case' UTF-8 bug

Denys Vlasenko vda.linux at googlemail.com
Wed Jul 5 17:15:02 UTC 2017


On Wed, Jul 5, 2017 at 6:15 PM, Denys Vlasenko <vda.linux at googlemail.com> wrote:
> I reproduced it on another machine, with this libc:
>
> $ /lib/libc-2.22.so
> GNU C Library (Gentoo 2.22-r4 p13) stable release version 2.22, by
> Roland McGrath et al.
>
> The cause: ash uses chars 0x81...0x88 for special purposes.
> "π" is encoded as "cf 80" in unicode
> "ρ" is encoded as "cf 81" in unicode
> ash does have some code which handles 81 et al in user strings. Specifically,
> these two one-symbol strings are internally represented differently:
>
> "π" = CTLQUOTEMARK cf 80 CTLQUOTEMARK
> "ρ" = CTLQUOTEMARK cf CTLESC 81 CTLQUOTEMARK
>
> CTLESC is meant to prevent 81 to be misinterpreted.
>
> The bug is: when these strings are prepared for fnmatch(),
> CTLESC is not removed, but converted to \.
> Because it is also used for quoting * and ?, and these _do_ need escaping
> as \* and \? for fnmatch() to not interpret them as globbing patterns.
>
> Thus, ash ends up calling fnmatch('cf \ 81', 'cf 81', 0).
> This normally works - superfluous backslash-escapes are simply ignored,
> and this returns a match.
>
> I guess what happens is that in unicode locale, some versions of glibc
> do not allow backslash-escape _inside_ a unicode character.
> It probably freaks out seeing invalid unicode sequence.

Fix is in git, please try and let me know how does it work.


More information about the busybox mailing list