[BusyBox] Re: [PATCH] Re: Sigh... this time i attached the file (bunzip-4.c)
rob at landley.net
Sat Oct 25 02:48:30 UTC 2003
On Friday 24 October 2003 13:31, Manuel Novoa III wrote:
> On Fri, Oct 24, 2003 at 08:09:00AM -0500, Rob Landley wrote:
> > Actually, the n*log(n) shell sort you suggested earlier works well enough
> > for my purposes. Sorting one 900k block of data (which will be after run
> > length compression, so it's usually rather a lot more data than that)
> > takes a little over 13 seconds of user time (worst case, all zeroes) and
> > 8.15 seconds tested with random data. (This is on a 900 mhz celeron
> > laptop; not unreasonably far off from the times bzip itself is getting,
> > although I haven't torn apart and instrmented bzip's block sorting stuff
> > yet...)
> > The shell sort uses no more data than the 900k of charcters and the index
> > array. I also whipped up a couple of GET_IDX and SET_IDX functions that
> > only use 20 bits for each index array entry instead of 32 (bringing the
> > total block sorting memory usage down to 3125000 bytes), but all the
> > extra anding, oring, and bit-shifting bloated the worst-case sort time to
> > 22 seconds for a 900k block of zeroes...
> Am I missing something here? You want to (stably) sort a 900k block of
> bytes. Why would you use an O(n*log(n)) general purpose sort when you
> can do that in O(n)?
It's not exactly sorting the bytes. A burrows wheeler transform involves
sorting a bunch of variants of the same string.
Turn the array into a ring buffer. Start at each point of the ring buffer and
continue for sizeof(buf) wrapping around until you reach the byte right
before you started at. Do that for(start=0;start<len;start++) so you get a
string starting at each possible position. Now sort the strings.
Of course nobody in their right mind makes 900k copies of a 900k string, but
the result isn't actually a sorted buffer (you could do that with bucket sort
with a for loop), easy. The result is an array of starting points, and since
the highest starting point is 899999, your starting point array has to have
at least 20 bits per value.
The worst case is when every character in the string is the same, so that the
strncmps you're doing on the segments of the buffer have to keep going
through all 900k, twice, for each comparison, just to prove they're equal and
don't need to be moved. For normal data (or random data), the strcmp can
stop after a few bytes (or, for the last few comparisons when you get close
to having the puppy sorted, maybe a few kilobytes), and figure out which way
the inequality falls. Thus it's much faster.
Or words to that effect.
More information about the busybox