[BusyBox] Re: Where's the bzip2 compressed linux-kernel patch?

Thu Oct 23 12:45:20 UTC 2003

On Wed, 22 October 2003 17:01:39 -0500, Rob Landley wrote:
> 
> There's 8 gazillion asserts, which should probably be comments.  There are an 
> awful lot of tests that translate to "I don't trust the C compiler".  This 
> has all the hallmarks of a theoretical mathematician trying to defend himself 
> from the evils of an implementation ever having to run on a real machine that 
> could have a bit flipped by cosmic rays... :)

If the asserts were actually called assert and not reimplemented, I'd
like them.  And not calling de-facto assertions assert is right-out
evil, yes.

> > This will take time, more than a year, likely.  The design is still
> > evolving, and I have to figure out if I can squeeze some academic
> > title out of this.  As long as people get them for obvious bullshit
> > like the above, why not. :)
> 
> I look forward to seeing it then, but the bzip conversion should be done LONG 
> before then.  (By the end of the month, I hope.  Don't hold me to that.)

Actually, I just saw that Andrew Tridgell has already done a similar
thing as part of his phd thesis.  It has the easy part finished, so
things could go faster.  We'll see.

>>[UPX]
> I thought you could use the open decompressor with the closed compressor.  
> (And has anybody tested that it IS lossless?  I thought all they tested was 
> that it runs.  Elf files often have a bit of cruft in them that doesn't get 
> loaded into memory.  Since the upx decompressor thing basically takes the 
> place of the executable loader, there's room for fudging it, isn't there?)
> 
> Of course since it IS closed, I have no idea what it's really doing.  This is 
> just a guess based on knowledge of a thousand ways to pull a fast one while 
> claiming to be brilliant.  (I have a black belt in cheap plastic solutions. 
> :)

We could always analyse the file format, but that is just too much
pain for me.  They did the work and if they can make some money with
it, I'm all for it.  Gzip remains free. :)

> > But that only works for the inflate side.  For compression, you need
> > 7.2, plus original buffer, plus some overhead.  More like 9M,
> > actually.
> 
> True.  (Actually, I'll take your word for it.  I'm still a few days away from 
> having any idea how much memory this stuff actually needs.  Real Life hath 
> intruded.)

Sorting is the big memory hit.  You need 8n for radix sort and
variants, or 4n for qsort and bubble sort.  Qsort and bubble sort
alone should be pretty slow, so 8n is the answer.

> > Right.  You surely don't want too many of those running in parallel.
> 
> Depends.  You could overlap emptying the intermediate buffer (dbuf) with 
> filling up the next intermediate buffer.  (The first is streaming, the second 
> is random.)  And if your SMP system has 4 megs of local L2 cache per 
> processor, you can parallel the blocks to your hearts' content...
> 
> This is for a monster server, though.  A linksys box wouldn't want to waste 
> the electricity and heat budget even for hyper-threading...

And even then, you don't want much more than 2x #CPU of those beasts
running in parallel.  Disk cache is a better use for memory.

> > Also agreed.  Sometimes it shows that 6 month ago, I was still
> > completely unaware about compression.  On the other hand, the reasons
> > to put multiple compression algorithms into the kernel are growing
> > with time.  Swap compression will make sense someday, as one example.
> 
> I think it already does.  But gzip, not bzip,and you'd probably want to group 
> pages together into about 64k chunks.  (I forget how often the dictionary 
> gets flushed normally anyway.  My last run through the deflate code was 
> around 1997.  It's on my to-do list...)

With 64k 'pages' compression makes less sense again.  Disk bandwidth
is a small problem, latency a big one.  Bigger chunks take identical
disk latency but longer uncompression time.  Compression ration is
better, but what you want to save is time, not space.  Needs a fair
bit of measurement, for sure.

> > Speaking of it, have you done something similar for gzip already?
> 
> I ported the compression side to java 1.0 in 1997.  A cleanup of gzip code in 
> busybox is on my to-do list after I get bzip compression done, but it'll be a 
> while.

Great!  That should make my life much easier again.

Jörn

-- 
To recognize individual spam features you have to try to get into the
mind of the spammer, and frankly I want to spend as little time inside
the minds of spammers as possible.
-- Paul Graham