BB UTF-8 support evaluation/TODO list
Rich Felker
dalias at aerifal.cx
Sat Jul 15 22:57:49 PDT 2006
the following is a quick cursory list i've made of everywhere in
busybox that multibyte locale (utf-8) support is lacking. this isn't
intended as a bug report since bb does not claim support yet, but
rather as a reference for anyone looking for areas to contibute.
asking about utf-8 in the past and especially the issue of size/bloat
versus functionality, the answer i've received is that implementing
correct utf-8 support is ok as long as it's conditional on the locale
support configuration option. however, users of legacy 8bit locales
might appreciate if we made a secondary locale option for full
multibyte locale support, so that they could continue to have their
8bit locales without increased bloat.
i've labelled each affected applet with a priority:
LOW - missing support is optional or involves only nonstandard features
MID - support is needed for SUSv3 compliance
HIGH - needed for SUSv3 and for basic everyday usability
*** coreutils ***
cat (LOW)
maybe something for the bloated options (-vetET); not important imo
cut (MID)
cutting by characters does not work.
multibyte delimiters don't work.
fold (HIGH, if anyone uses it)
in order to correctly fold text, fold needs to know character widths,
not byte counts.
length (LOW)
perhaps length should be able to print character cell width or
character count in addition to byte count.
ls (HIGH)
column alignment is totally wrong without counting character cells.
sort (MID)
probably working but unconfirmed.
tr (MID)
currently operates on bytes not characters
"correct" multibyte support seems ugly to implement and maybe even
undesirable..?
wc (MID)
wc is supposed to have an option to count characters instead of bytes.
*** editors ***
vi (HIGH, for vi freaks)
cursor positioning, etc. highly depend on knowing character and column
counts, not byte counts.
sed (MID)
the y/source/dest/ command needs to process multibyte characters.
all the regex stuff should already work as long as the system regex
lib works.
*** shells ***
all shells (HIGH)
command line editing is broken with multibyte chars.
arrow keys need to move by character, not byte.
backspace needs to delete characters.
need to be aware of nonspacing and east-asian-wide characters.
i might have missed a few applets, but i think the above is an
almost-complete list.
rich
More information about the busybox
mailing list