[BusyBox] Re: Where's the bzip2 compressed linux-kernel patch?

Wed Oct 22 12:48:23 UTC 2003

On Wed, 22 October 2003 06:13:50 -0500, Rob Landley wrote:
> On Tuesday 21 October 2003 08:15, Jörn Engel wrote:
> 
> The sort should more or less be pluggable, so what I may do is just abstract 
> it out into a sort() method, feed that to the library implementation of 
> quicksort for now, and come back after I've got the rest of it converted to C 
> and see what speedups are worth the increase in executable size.

Makes sense.  Comlicated stuff with a simple interface - except that
Julian passed all sorts of unnecessary mess to the sort functions as
well.

> I still think laptops are evolving to replace the big monster servers.  Hot 
> pluggable hardware?  Sure: pcmcia and USB.  Uninterruptible power supply?  
> Try hot-swappable batteries with hours of usage time and battery monitoring 
> well-integrated into the rest of the system.  Journaling filesystems?  But of 
> course.  And of course the laptop people noticed that power consumption (and 
> the resulting heat) was important before the clustering people figured out 
> what the limiting factor on cluster density really was.  (Your 
> air-conditioning bill.)
> 
> And these days, what is a "blade" except rackmounted laptop components 
> replacing the 1U?

Two years ago someone commented to supercomputer export limitations
that a cluster of laptops would just ridicule the very idea of it.
Back then I laughed, but blade servers are just that, except for the
premium price and vendow lock-in. :)

> > 1. Random data is trivial to sort.  Any decent algorithm is O(n*ln(n))
> > 2. "Normal" data is close enough to random data.  O(n*ln(n))
> > 3. Degenerated data causes most algorithms to approach O(n^2).
> 
> I did have an algorithms class in college. :)

I didn't, never understood how such a class could be missing. :(

> > 3.1. "aaaaaaaaaaaaaaaaaaaa..." - all characters identical - simple to
> >      catch, not a real problem.
> > 3.2. "aaaaaaaaaaaaaaaaaaaa...aaab" - a single character different.
> >      This is much harder to catch and optimize.  It is also a very
> >      real problem, as many binaries have large ranges of 0x00, so this
> >      needs to be fast.
> > 3.3. "abcdefghijklmnopq...abcdefghijklmnopq..." - harmless string
> >      concatenated to itself.  Also causes O(n^2) behaviour, but needs
> >      a completely different solution.
> > 3.4. "a" -> "aba" -> "abacaba" -> "abacabadabacaba" - nasty
> >      combinations of 3.2 and 3.3.  Special treatment for either one
> >      causes you to catch the other worst case here.  Hard.
> 
> Looking at the block sorting machinery, I've largely been trying to see if I 
> could throw any of it away without losing too much speed, or at least 
> refactor it to make the fallback order a bit more readily apparent.  I don't 
> expect to be able to fundamentally improve on the sorting theory end of 
> things without spending several months in the library, which I haven't got 
> time for just now.  However, the decompress code was so complicated and 
> crufty that it hid a good dozen major thinkos, which allowed some serious 
> simplification after it was all untangled.  There might be similar 
> opportunities on the compression side without actually changing the theory on 
> which it operates.

Very likely.  The code still contains debugging output, but hasn't
been changed for quite a while, as it reached bugfix-only mode.
Something doesn't match, imo. :)

> One question I'm still rooting around for an answer to is if the first batch  
> of run length compression (the one undone after the burrows-wheeler 
> detransform during decompress) is done before the sorting in order to 
> mitigate 3.1 and 3.2 from your list.  I think so, but I haven't found it in 
> the compress code yet.  (I will, I was just busy on the decompression side 
> this evening, and have had about my fill of bzip for one day already.  I'll 
> pick it up again tomorrow.)

I owe you one!  For some reason, I never noticed the run length
encoding and kept wondering how Julian's implementation could be so
quick.  All the papers on efficient block sorting claimed things like
O(n*ln(n)) for *any* data, were complicated beyond any sense and I
couldn't figure out, how this could even be *possible*.

Now I know that it really was just bullshit (enough for a phd thesis
or two :) and Julian's speed advantage is just rle.  In other words, I
have to convince my employer to open source my sorting method and you
can very likely throw away the fallback algorithm. :)

> > There may be more, but those are the types I have found.  And they
> > really matter in the real world.  I've been hit hard by the swen worm,
> > receiving some 100 mails per day, each having identical or similar
> > attachments.  Trying to compress these with bzip2 takes ages, while
> > gzip is lightning fast.  This is one of the reasons for me to design a
> > new compression algorithm that improves on bzip2. :)
> 
> I'm all for it, but if code isn't available I'm not going to pay any attention 
> to it.

This will take time, more than a year, likely.  The design is still
evolving, and I have to figure out if I can squeeze some academic
title out of this.  As long as people get them for obvious bullshit
like the above, why not. :)

> I must admit to being curious whether that closed-source UPX thing people were 
> talking about is removing unnecessary ELF segments from the executable it's 
> compressing.  That's one really easy way to make yourself look good, strip 
> debugging info... :)

It is not lossless anymore, though.  I can imagine that they simply
pre-generate an LZ77 or LZ78 dictionary with common strings.  The LZ
algorithms start compressing the second identical string, with a
pre-defined dictionary, you can already compress the first.

Drawback is a larger compressor/decompressor executable, but that is
just one, compared to many smaller payload files.

> > To me, the workspace problem is more interesting.  Kernel zlib doesn't
> > malloc anymore, bzlib should receive the same treatment.  And after
> > that, how do we minimize the memory impact, 7MiB per bzlib stream is
> > nothing for the faint of heart.
> 
> Well, my implementation has it down to 4 megabytes per stream.  (3.6 megs for 
> dbuf, plus a couple hundred k for sundriesin bzip_data.)  I vaguely remember 
> trimming an amazing amount of crap outof the data structures, including some 
> humongous allocations that were just bonkers.

But that only works for the inflate side.  For compression, you need
7.2, plus original buffer, plus some overhead.  More like 9M,
actually.

> That said, you're not getting it below 1 megabyte per stream no matter what 
> you do, because the intermediate buffer has to store 900k before the 
> burrows-wheeler transform can be undone, and you CAN'T decompress until you 
> have a complete block.

Right.  You surely don't want too many of those running in parallel.

> This is not designed to be a streaming protocol.  (The latency on output is 
> simarly horked: you try flushing the output after 4k of data, and your 
> compression ratio is going to be hamstrung.)  You're talking about an 
> application for this that does not interest me.

Also agreed.  Sometimes it shows that 6 month ago, I was still
completely unaware about compression.  On the other hand, the reasons
to put multiple compression algorithms into the kernel are growing
with time.  Swap compression will make sense someday, as one example.

> You're talking about a problem I have no interest in.  Feel free to adapt 
> Manuel and my bunzip cleanup if you think it'll help, it's lgpl, but I'm just 
> not going there.  Bzip is the best way to distribute tarballs at the moment, 
> possibly good for compressing kernel images, etc.  It's not a replacement for 
> mp3 or mp4, either. :)

Or UPX, to name a (hopefully) lossless algorithm.  Compression always
makes assumptions about the data, but some assumptions are more
general than others.  Gzip and bzip2 are the most generalized in
actual use, afaik.  I'll try to add a third, but I might just as well
fail.  Your (including Manuel) work still helps me, working with bzip
is a pain, and gzip is even much worse.

Speaking of it, have you done something similar for gzip already?

Jörn

-- 
Everything should be made as simple as possible, but not simpler.
-- Albert Einstein