[BusyBox] Re: Where's the bzip2 compressed linux-kernel patch?

Jörn Engel joern at wohnheim.fh-wedel.de
Tue Oct 21 13:15:39 UTC 2003


A bit long, sorry about that.

On Mon, 20 October 2003 20:42:49 -0500, Rob Landley wrote:
> On Monday 20 October 2003 09:33, Jörn Engel wrote:
> > On Mon, 20 October 2003 04:41:57 -0500, Rob Landley wrote:
> 
> The true block sorting stuff is really compression side only.  The 
> decompression already has all this sorted.  Undoing the burrows-wheeler 
> transform does require a kind of sort, but the end result isn't a sorted 
> array but a linked list showing what order the sorted array would have been 
> in, and they kind of fake their way around it with a one-pass thing that's 
> actually fairly neat.  (That's the "fast" algorithm.  The slow one I haven't 
> thought through yet.)

Yes, that is the suffix array, just in the unsorted form.  :)

> The sorting side has FOUR DIFFERENT sorting algorithms, done entirely for 
> speed: it tries each one and falls back to the next after a certain work 
> threshold.  I'm shying away from that and cleaning up other bits at the 
> moment, trying to figure out what "cleaned up" means in this context...

Well, just two, depending on how you count.  A fast algorithm that
sucks for "repetitive" data and counts the work it has done.  If it
has to work more than a given "budget", it will throw away everything
and start over with the slow algorithm.

The slow algorithm basically does an initial bucket sort.  All the
buckets are much smaller and processed with quicksort (Sedgewick
variant for strings).  When the quicksort branches are small enough or
don't shrink fast enough, fall back to bubble sort.

Initial bucket sort is wrong, but close to the truth.  Quicksort is
too slow, when data sets grow large, bucket sort has a lot of overhead
per iteration, so it is slow when data sets are small.  Initial bucket
sort and fallback to quicksort is a natural trick. :)

The bubble sort is a bit nifty, it bubbles all elements up by 4 places
first, then does another run an bubbles up 1 place at a time.  This
makes thing a little faster - you could even sort large datasets in
O(n*ln(n)) if you bubble by 64, 16, 4, 1 etc.  Bubblesort has a little
less overhead than Julian's quicksort implementation, but I have a
slightly faster one, so that isn't really necessary.

But bubblesort is *much* faster when you have degenerated data to
sort, like "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbba".
Quicksort will really fails here, bubblesort still does quite well.

Back to the initial "bucketsort".  This I haven't completely figured
out yet.  There are plenty of papers on the subject, but they are not
exactly easy reading and Julian deviated a little from the algorithms
in the papers.  The performance appears to be better than simple
bucket sort, so this is quite interesting.

Concatenating two long random strings together is another worst case
for sorting, but one where Julian does quite well.  He must have
solved that problem at the root, while I tried to do it at the leafs.
I've gained an 8x performance boost, but he still wins.  Darn!

Source code for my implementation is not available, sorry, but I hope
this analysis helps you anyway. :)

> > > Any future uber-xeon that has a 4 megabyte L2 cache is going to do bunzip
> > > _really_ fast. :)
> >
> > Wouldn't running busybox on that machine be kinda pointless maybe? ;)
> 
> A simple, straightforward, understandable implementation of an algorithm is 
> generally preferable to a craptacular pile of cruft.  It took the FSF 833 
> lines of C to implement "cat".  Yes, really.  I am not in favor of this.

Agreed.  It regularty occurs to me that really big and really small
machines have the same problems.  But the better solution comes from
the embedded people, as they cannot just throw more hardware at the
problem. :)

> Right now I'm mentally nibbling at the edge of the block sorting machinery, 
> and trying to figure out where the speed/complexity tradeoffs really are.  

This is hard to figure out, unless you can thing of the different
kinds of data.  Maybe this helps as well.

1. Random data is trivial to sort.  Any decent algorithm is O(n*ln(n))
2. "Normal" data is close enough to random data.  O(n*ln(n))
3. Degenerated data causes most algorithms to approach O(n^2).
3.1. "aaaaaaaaaaaaaaaaaaaa..." - all characters identical - simple to
     catch, not a real problem.
3.2. "aaaaaaaaaaaaaaaaaaaa...aaab" - a single character different.
     This is much harder to catch and optimize.  It is also a very
     real problem, as many binaries have large ranges of 0x00, so this
     needs to be fast.
3.3. "abcdefghijklmnopq...abcdefghijklmnopq..." - harmless string
     concatenated to itself.  Also causes O(n^2) behaviour, but needs
     a completely different solution.
3.4. "a" -> "aba" -> "abacaba" -> "abacabadabacaba" - nasty
     combinations of 3.2 and 3.3.  Special treatment for either one
     causes you to catch the other worst case here.  Hard.

There may be more, but those are the types I have found.  And they
really matter in the real world.  I've been hit hard by the swen worm,
receiving some 100 mails per day, each having identical or similar
attachments.  Trying to compress these with bzip2 takes ages, while
gzip is lightning fast.  This is one of the reasons for me to design a
new compression algorithm that improves on bzip2. :)

> > My killer application for this was and still is jffs2.  But tests
> > showed, that bzip2 actually compresses worse than gzip when running on
> > chunks of <=4k.  I've pretty much figured out the reason by now and am
> > just writing on a paper with my findings.  Hope that IBM allows me to
> > follow up on that and create a now compression suite that is actually
> > better than bzip2.  We'll see.
> 
> There's a number of little places bzip2 could be tightened up (use 64 symbols 
> instead of 50 for the chunk size, what IS the deal with that discarded symbol 
> at the beginning of each block and the fact the first three symbols don't 
> qualify as a run...) but backwards compatability is probably going to doom 
> any such effort.

Right, but I'm off to burn the boats and start from scratch anyway.
When (and if) I'm finished, the result should be incompatible to
bzip2, never compress worse and never run longer.  Ok, 1% worse here
and there may be acceptable, but nothing noticeable.

> > - How can we reduce the memory consumption of say 1000 open ipsec
> >   connections, eating up a zlib workspace each?
> 
> My lack of caring about ipsec is fairly deep.  It's at the wrong layer.  
> Encryption is inherently stateful.  The packet layer isn't supposed to be 
> stateful.  The TCP/IP SESSION layer is supposed to be stateful...
> 
> You're dealing with a fundmanetally misdesigned protocol, I'm afraid.  Fixing 
> it probably isn't actually possible, the best you can do is mitigate the 
> damage.

While I agree with you, the problem would just shift somewhere else
with a better design.  Having a server with thousands of compressed
connections results in a big memory impact.  With the plain zlib
implementation this is less than inside the kernel with explicit
workspaces *on average*, but the worst case behaviour is identical.

This problem needs to be solved.  We can shift it somewhere else and
postpone the solution for some time, but not forever.

> > - How much code can be shared amoung different compression algorithms?
> 
> The CRC algorithms are similar.  There's only a few ways the basic CRC usualy 
> gets mangled: big endian or little endian, initialize to 0 or ffffffff, and ~ 
> the result at the end or not...
> 
> If I remember, gzip is big endian and bzip is little endian.  Initializing the
> table is about four lines of code, and trying to sharing the data at run time 
> sounds like an amazing pain that's BOUND to hurt throughput...

Not sure about that, but crc is already singled out in current 2.6
kernels, so that part should be simple.

To me, the workspace problem is more interesting.  Kernel zlib doesn't
malloc anymore, bzlib should receive the same treatment.  And after
that, how do we minimize the memory impact, 7MiB per bzlib stream is
nothing for the faint of heart.

> > - There are users that will compress block-based, others need
> >   streaming support.  Can these two classes share code?  How?
> 
> Bzip is designed to work on large quantities of data.  It's not a packet level
> protocol, and isn't likely to become one.

Right, I agree and redesign. :)

Jörn

-- 
The grand essentials of happiness are: something to do, something to
love, and something to hope for.
-- Allan K. Chalmers



More information about the busybox mailing list