[PATCH] Compress common substrings in applet names to save space

Wed Apr 20 13:09:25 UTC 2016

On Wed, Apr 20, 2016 at 2:54 PM, Ron Yorston <rmy at pobox.com> wrote:
> Denys Vlasenko wrote:
>>How about this version?
>
> Neat.  The generated include file is much prettier and you managed to
> squeeze out a few more bytes.
>
> A couple of things:
>
>    Since you've dropped the adjustment to the number of offsets you can
>    remove 2*KNOWN_APPNAME_OFFSETS from the size calculation;
>
>    hex_char() should really be called octal_char();
>
>    and it has a whitespace issue.  (Sorry, that's three things.)
>
>>However, I also think that this idea is slightly too insane...
>
> Aww.  :-(
>
> Most of the complexity is at build-time, in the generation of the applet
> tables.  I'm actually quite pleased with how the run-time code turned out.

I just checked. About the same savings can be achieved a different,
less insane way.

- gzip applet_names[], store only gzipped stream.

- two users of applet_names[] in lineedit.c and ash.c unpack it when needed.

- find_applet_by_name(name) does not use applet_names[] at all.
  Instead, it calculates a hash of "name"
  and looks for it in uint16_t applet_hashes[]. I tried this hash:

static uint32_t hash(const char *s, uint32_t mul)
{
        uint32_t h = *s++;
        while (*s)
                h = (h * mul) ^ (uint8_t)*s++;
        return (uint16_t)h & 0xffff;
}

- applet_hashes[] is precalculated, MUL is chosen so that
  there are no collisions, every applet's hash is unique.
  My test code so far:

  applet_tables.c
  ========================
        uint32_t mul;
        mul = 1;
 collision:
        for (i = 0; i < NUM_APPLETS; i++) {
                int j;
                uint32_t h = hash(applets[i].name, mul);
                for (j = 0; j < i; j++) {
                        if (h == applets[j].app_hash) {
                                fprintf(stderr, "mul:%u '%s'~'%s'\n",
                                        mul, applets[j].name, applets[i].name);
                                mul += 2;
                                goto collision;
                        }
                }
                applets[i].app_hash = h;
        }
        for (i = 0; i < NUM_APPLETS; i++) {
                fprintf(stderr, "mul:%u '%s' h:0x%04x\n", mul,
applets[i].name, applets[i].app_hash);
        }

mul:1 'basename'~'beep'
mul:3 '[['~'df'
mul:5 'ar'~'dc'
mul:7 'uudecode'~'uuencode'
mul:9 'env'~'ifconfig'
mul:11 'mv'~'nc'
mul:13 'cp'~'dc'
^^^^^^^^^^^^^^^^ those MULs are not good, collisions are shown
mul:15 '[' h:0x005b   <======= MUL=15 works! here's the list...
mul:15 '[[' h:0x050e
mul:15 'acpid' h:0x0b1f
mul:15 'add-shell' h:0x2898
mul:15 'addgroup' h:0x602e
mul:15 'adduser' h:0x22b0
mul:15 'adjtimex' h:0xd9e0
mul:15 'ar' h:0x05dd
mul:15 'arp' h:0x5783
mul:15 'arping' h:0xc669
...

Upsides:
- it is smaller (~same savings as our monstrosity)
- it is very likely faster
- no bss growth

Downside: "busybox BOGUSNAME" has a small chance of actually working
as some applet, if hash(BOGUSNAME) is matching something.
This is not too bad, since people are mostly concerned when
"busybox GOODNAME" doesn't work, not the other way around.

To reduce that probability, find_applet_by_name(name) can be taught
to refuse names with known wrong chars (e.g. uppercase letters) or length
(appnames are all 2+ chars except "[").