[BusyBox] Re: [PATCH] Re: Sigh... this time i attached the file (bunzip-4.c)
rob at landley.net
Fri Oct 24 13:09:00 UTC 2003
On Friday 24 October 2003 04:45, Jörn Engel wrote:
> On Fri, 24 October 2003 00:28:28 -0500, Rob Landley wrote:
> > On Thursday 23 October 2003 11:01, Jörn Engel wrote:
> > > > When reading from a filehandle, we can easily figure out how much we
> > > > overshot and lseek backwards a bit. The only question is should this
> > > > be in the gzip/bzip library code, or in the code that calls it?
> > >
> > > It should definitely be in the library, completely transparent to the
> > > application, imo.
> > >
> > > Not exactly sure what bzip2 uses the overshoot for,
> > We read 4k at a time from input into a buffer, so get_bits can return an
> > arbitrary number of bits of input when asked. If we needed to do a
> > syscall to get the nexty byte, it would slow the implementation down
> > tremendously.
> > The problem with it being in the library is that lseek has no effect when
> > doing "cat file.tbz | progname". You can't seek backwards on stdin...
> Sounds like my guess was wrong, you didn't refer to the same
> 'overshoot', I thought you did. :)
> Not sure then. It might be an idea to change interface slightly,
> letting the user provide a read function.
It will still read more data than it needs from their supplied function, so
how does this help? (Sure the user can snapshot the data somehow as it goes
through their input function, but unless they go back and query the
compression library for how MUCH extra data it read, this will do them no
good. If you're querying the library, the library can be made to cough up
not just the length but also the extra data it read, so why reinvent the
wheel?) They can either reinvent stdio, or they can ask for the extra data
back at the end, but either way it's going to be a bit funky...
> > > My (ugly) solution was to use char[2*900000], so my memory footprint
> > > is 10n, instead of 9n. I guess that the overshoot is just a little
> > > smarter, leaving some corner cases in but saving most of the memory I
> > > wasted.
> > It sounds like your footprint was 18n, actually...
> Nope, I merely copied the plain text, not the suffix arrays.
Actually, the n*log(n) shell sort you suggested earlier works well enough for
my purposes. Sorting one 900k block of data (which will be after run length
compression, so it's usually rather a lot more data than that) takes a little
over 13 seconds of user time (worst case, all zeroes) and 8.15 seconds tested
with random data. (This is on a 900 mhz celeron laptop; not unreasonably far
off from the times bzip itself is getting, although I haven't torn apart and
instrmented bzip's block sorting stuff yet...)
The shell sort uses no more data than the 900k of charcters and the index
array. I also whipped up a couple of GET_IDX and SET_IDX functions that only
use 20 bits for each index array entry instead of 32 (bringing the total
block sorting memory usage down to 3125000 bytes), but all the extra anding,
oring, and bit-shifting bloated the worst-case sort time to 22 seconds for a
900k block of zeroes...
My first pass at bzip compression may not work as fast as jseward's, but it'll
have a heck of a lot less code, and I can bolt the special-case-optimized
sorting functions onto it later, after I get it to work...
More information about the busybox