[BusyBox] Re: Where's the bzip2 compressed linux-kernel patch?

Wed Oct 22 11:13:50 UTC 2003

On Tuesday 21 October 2003 08:15, Jörn Engel wrote:

> Concatenating two long random strings together is another worst case
> for sorting, but one where Julian does quite well.  He must have
> solved that problem at the root, while I tried to do it at the leafs.
> I've gained an 8x performance boost, but he still wins.  Darn!
>
> Source code for my implementation is not available, sorry, but I hope
> this analysis helps you anyway. :)

The sort should more or less be pluggable, so what I may do is just abstract 
it out into a sort() method, feed that to the library implementation of 
quicksort for now, and come back after I've got the rest of it converted to C 
and see what speedups are worth the increase in executable size.

> > > > Any future uber-xeon that has a 4 megabyte L2 cache is going to do
> > > > bunzip _really_ fast. :)
> > >
> > > Wouldn't running busybox on that machine be kinda pointless maybe? ;)
> >
> > A simple, straightforward, understandable implementation of an algorithm
> > is generally preferable to a craptacular pile of cruft.  It took the FSF
> > 833 lines of C to implement "cat".  Yes, really.  I am not in favor of
> > this.
>
> Agreed.  It regularty occurs to me that really big and really small
> machines have the same problems.  But the better solution comes from
> the embedded people, as they cannot just throw more hardware at the
> problem. :)

I still think laptops are evolving to replace the big monster servers.  Hot 
pluggable hardware?  Sure: pcmcia and USB.  Uninterruptible power supply?  
Try hot-swappable batteries with hours of usage time and battery monitoring 
well-integrated into the rest of the system.  Journaling filesystems?  But of 
course.  And of course the laptop people noticed that power consumption (and 
the resulting heat) was important before the clustering people figured out 
what the limiting factor on cluster density really was.  (Your 
air-conditioning bill.)

And these days, what is a "blade" except rackmounted laptop components 
replacing the 1U?

> > Right now I'm mentally nibbling at the edge of the block sorting
> > machinery, and trying to figure out where the speed/complexity tradeoffs
> > really are.
>
> This is hard to figure out, unless you can thing of the different
> kinds of data.  Maybe this helps as well.
>
> 1. Random data is trivial to sort.  Any decent algorithm is O(n*ln(n))
> 2. "Normal" data is close enough to random data.  O(n*ln(n))
> 3. Degenerated data causes most algorithms to approach O(n^2).

I did have an algorithms class in college. :)

Personally, I was always rather fond of heap sort, but that's just me...

> 3.1. "aaaaaaaaaaaaaaaaaaaa..." - all characters identical - simple to
>      catch, not a real problem.
> 3.2. "aaaaaaaaaaaaaaaaaaaa...aaab" - a single character different.
>      This is much harder to catch and optimize.  It is also a very
>      real problem, as many binaries have large ranges of 0x00, so this
>      needs to be fast.
> 3.3. "abcdefghijklmnopq...abcdefghijklmnopq..." - harmless string
>      concatenated to itself.  Also causes O(n^2) behaviour, but needs
>      a completely different solution.
> 3.4. "a" -> "aba" -> "abacaba" -> "abacabadabacaba" - nasty
>      combinations of 3.2 and 3.3.  Special treatment for either one
>      causes you to catch the other worst case here.  Hard.

Okay, lemme clarify:

Sorting is its own discipline, which I have successfully managed to avoid 
being involved in for rather a long time now.  I actually implemented a 
couple of bubble sorts in things just because it was three lines of code and 
I knew the data set would never have more than about 35 entries in it.  (Go 
ahead and scream.  I don't mind.)

Looking at the block sorting machinery, I've largely been trying to see if I 
could throw any of it away without losing too much speed, or at least 
refactor it to make the fallback order a bit more readily apparent.  I don't 
expect to be able to fundamentally improve on the sorting theory end of 
things without spending several months in the library, which I haven't got 
time for just now.  However, the decompress code was so complicated and 
crufty that it hid a good dozen major thinkos, which allowed some serious 
simplification after it was all untangled.  There might be similar 
opportunities on the compression side without actually changing the theory on 
which it operates.

One question I'm still rooting around for an answer to is if the first batch  
of run length compression (the one undone after the burrows-wheeler 
detransform during decompress) is done before the sorting in order to 
mitigate 3.1 and 3.2 from your list.  I think so, but I haven't found it in 
the compress code yet.  (I will, I was just busy on the decompression side 
this evening, and have had about my fill of bzip for one day already.  I'll 
pick it up again tomorrow.)

> There may be more, but those are the types I have found.  And they
> really matter in the real world.  I've been hit hard by the swen worm,
> receiving some 100 mails per day, each having identical or similar
> attachments.  Trying to compress these with bzip2 takes ages, while
> gzip is lightning fast.  This is one of the reasons for me to design a
> new compression algorithm that improves on bzip2. :)

I'm all for it, but if code isn't available I'm not going to pay any attention 
to it.

> > There's a number of little places bzip2 could be tightened up (use 64
> > symbols instead of 50 for the chunk size, what IS the deal with that
> > discarded symbol at the beginning of each block and the fact the first
> > three symbols don't qualify as a run...) but backwards compatability is
> > probably going to doom any such effort.
>
> Right, but I'm off to burn the boats and start from scratch anyway.
> When (and if) I'm finished, the result should be incompatible to
> bzip2, never compress worse and never run longer.  Ok, 1% worse here
> and there may be acceptable, but nothing noticeable.

Cool.

I must admit to being curious whether that closed-source UPX thing people were 
talking about is removing unnecessary ELF segments from the executable it's 
compressing.  That's one really easy way to make yourself look good, strip 
debugging info... :)

> > > - How can we reduce the memory consumption of say 1000 open ipsec
> > >   connections, eating up a zlib workspace each?
> >
> > My lack of caring about ipsec is fairly deep.  It's at the wrong layer.
> > Encryption is inherently stateful.  The packet layer isn't supposed to be
> > stateful.  The TCP/IP SESSION layer is supposed to be stateful...
> >
> > You're dealing with a fundmanetally misdesigned protocol, I'm afraid. 
> > Fixing it probably isn't actually possible, the best you can do is
> > mitigate the damage.
>
> While I agree with you, the problem would just shift somewhere else
> with a better design.  Having a server with thousands of compressed
> connections results in a big memory impact.  With the plain zlib
> implementation this is less than inside the kernel with explicit
> workspaces *on average*, but the worst case behaviour is identical.

I made a vpn once combining port forwarding and transparent proxying to 
intercept connections, forward them to something that looked up which gateway 
handled that subnet, did a passwordless ssh to the remote machine (with stdin 
and stdout redirected to the ssh session), and at the remote end it called 
netcat to complete the connection.  Worked beautifully, and scaled amazingly.  
(Didn't handle UDP, but oh well.)

Really easy to do, too.  Total implementation was a couple hundred lines of C.  
Read special comments out of /etc/hosts to know what gateway handled what IP 
address range.

I keep meaning to do an open source implementation of that.  It's on my to-do 
list... :)

> This problem needs to be solved.  We can shift it somewhere else and
> postpone the solution for some time, but not forever.
>
> > > - How much code can be shared amoung different compression algorithms?
> >
> > The CRC algorithms are similar.  There's only a few ways the basic CRC
> > usualy gets mangled: big endian or little endian, initialize to 0 or
> > ffffffff, and ~ the result at the end or not...
> >
> > If I remember, gzip is big endian and bzip is little endian. 
> > Initializing the table is about four lines of code, and trying to sharing
> > the data at run time sounds like an amazing pain that's BOUND to hurt
> > throughput...
>
> Not sure about that, but crc is already singled out in current 2.6
> kernels, so that part should be simple.

It's also five lines to initialize the table and one to use it (which is 
always inlined for speed), so it's not a major win to collect it together 
anyway...

> To me, the workspace problem is more interesting.  Kernel zlib doesn't
> malloc anymore, bzlib should receive the same treatment.  And after
> that, how do we minimize the memory impact, 7MiB per bzlib stream is
> nothing for the faint of heart.

Well, my implementation has it down to 4 megabytes per stream.  (3.6 megs for 
dbuf, plus a couple hundred k for sundriesin bzip_data.)  I vaguely remember 
trimming an amazing amount of crap outof the data structures, including some 
humongous allocations that were just bonkers.

That said, you're not getting it below 1 megabyte per stream no matter what 
you do, because the intermediate buffer has to store 900k before the 
burrows-wheeler transform can be undone, and you CAN'T decompress until you 
have a complete block.

This is not designed to be a streaming protocol.  (The latency on output is 
simarly horked: you try flushing the output after 4k of data, and your 
compression ratio is going to be hamstrung.)  You're talking about an 
application for this that does not interest me.

> > > - There are users that will compress block-based, others need
> > >   streaming support.  Can these two classes share code?  How?
> >
> > Bzip is designed to work on large quantities of data.  It's not a packet
> > level protocol, and isn't likely to become one.
>
> Right, I agree and redesign. :)

Feel free.  I'm not interested in that.  Packet level encryption is a bad 
idea.  Public key cryptography is too expensive to renegotiate each packet, 
so you MUST save state, and the packet layer is designed to be stateless, 
hence you've shot yourself in the foot before you even begin, and I'm just 
not going there.  I encrypt at the session layer, and I can't do it 
transparently with transparent proxying for tcp/ip sessions, and that's good 
enough for me.

You're talking about a problem I have no interest in.  Feel free to adapt 
Manuel and my bunzip cleanup if you think it'll help, it's lgpl, but I'm just 
not going there.  Bzip is the best way to distribute tarballs at the moment, 
possibly good for compressing kernel images, etc.  It's not a replacement for 
mp3 or mp4, either. :)

> Jörn

Rob