[BusyBox] Re: [PATCH] Re: Sigh... this time i attached the file (bunzip-4.c)

Sat Oct 25 02:48:30 UTC 2003

On Friday 24 October 2003 13:31, Manuel Novoa III wrote:
> Rob,
>
> On Fri, Oct 24, 2003 at 08:09:00AM -0500, Rob Landley wrote:
> > Actually, the n*log(n) shell sort you suggested earlier works well enough
> > for my purposes.  Sorting one 900k block of data (which will be after run
> > length compression, so it's usually rather a lot more data than that)
> > takes a little over 13 seconds of user time (worst case, all zeroes) and
> > 8.15 seconds tested with random data.  (This is on a 900 mhz celeron
> > laptop; not unreasonably far off from the times bzip itself is getting,
> > although I haven't torn apart and instrmented bzip's block sorting stuff
> > yet...)
> >
> > The shell sort uses no more data than the 900k of charcters and the index
> > array.  I also whipped up a couple of GET_IDX and SET_IDX functions that
> > only use 20 bits for each index array entry instead of 32 (bringing the
> > total block sorting memory usage down to 3125000 bytes), but all the
> > extra anding, oring, and bit-shifting bloated the worst-case sort time to
> > 22 seconds for a 900k block of zeroes...
>
> Am I missing something here?  You want to (stably) sort a 900k block of
> bytes.  Why would you use an O(n*log(n)) general purpose sort when you
> can do that in O(n)?
>
> Manuel

It's not exactly sorting the bytes.   A burrows wheeler transform involves 
sorting a bunch of variants of the same string.

Turn the array into a ring buffer.  Start at each point of the ring buffer and 
continue for sizeof(buf) wrapping around until you reach the byte right 
before you started at.  Do that for(start=0;start<len;start++) so you get a 
string starting at each possible position.  Now sort the strings.

Of course nobody in their right mind makes 900k copies of a 900k string, but 
the result isn't actually a sorted buffer (you could do that with bucket sort 
with a for loop), easy.  The result is an array of starting points, and since 
the highest starting point is 899999, your starting point array has to have 
at least 20 bits per value.

The worst case is when every character in the string is the same, so that the 
strncmps you're doing on the segments of the buffer have to keep going 
through all 900k, twice, for each comparison, just to prove they're equal and 
don't need to be moved.  For normal data (or random data), the strcmp can 
stop after a few bytes (or, for the last few comparisons when you get close 
to having the puppy sorted, maybe a few kilobytes), and figure out which way 
the inequality falls.  Thus it's much faster.

Or words to that effect.

Rob