[BusyBox] Re: [PATCH] Re: Sigh... this time i attached the file (bunzip-4.c)

Rob Landley rob at landley.net
Fri Oct 24 13:09:00 UTC 2003


On Friday 24 October 2003 04:45, Jörn Engel wrote:
> On Fri, 24 October 2003 00:28:28 -0500, Rob Landley wrote:
> > On Thursday 23 October 2003 11:01, Jörn Engel wrote:
> > > > When reading from a filehandle, we can easily figure out how much we
> > > > overshot and lseek backwards a bit.  The only question is should this
> > > > be in the gzip/bzip library code, or in the code that calls it?
> > >
> > > It should definitely be in the library, completely transparent to the
> > > application, imo.
> > >
> > > Not exactly sure what bzip2 uses the overshoot for,
> >
> > We read 4k at a time from input into a buffer, so get_bits can return an
> > arbitrary number of bits of input when asked.  If we needed to do a
> > syscall to get the nexty byte, it would slow the implementation down
> > tremendously.
> >
> > The problem with it being in the library is that lseek has no effect when
> > doing "cat file.tbz | progname".  You can't seek backwards on stdin...
>
> Sounds like my guess was wrong, you didn't refer to the same
> 'overshoot', I thought you did. :)
>
> Not sure then.  It might be an idea to change interface slightly,
> letting the user provide a read function.

It will still read more data than it needs from their supplied function, so 
how does this help?  (Sure the user can snapshot the data somehow as it goes 
through their input function, but unless they go back and query the 
compression library for how MUCH extra data it read, this will do them no 
good.  If you're querying the library, the library can be made to cough up 
not just the length but also the extra data it read, so why reinvent the 
wheel?)  They can either reinvent stdio, or they can ask for the extra data 
back at the end, but either way it's going to be a bit funky...

> > > My (ugly) solution was to use char[2*900000], so my memory footprint
> > > is 10n, instead of 9n.  I guess that the overshoot is just a little
> > > smarter, leaving some corner cases in but saving most of the memory I
> > > wasted.
> >
> > It sounds like your footprint was 18n, actually...
>
> Nope, I merely copied the plain text, not the suffix arrays.

Actually, the n*log(n) shell sort you suggested earlier works well enough for 
my purposes.  Sorting one 900k block of data (which will be after run length 
compression, so it's usually rather a lot more data than that) takes a little 
over 13 seconds of user time (worst case, all zeroes) and 8.15 seconds tested 
with random data.  (This is on a 900 mhz celeron laptop; not unreasonably far 
off from the times bzip itself is getting, although I haven't torn apart and 
instrmented bzip's block sorting stuff yet...)

The shell sort uses no more data than the 900k of charcters and the index 
array.  I also whipped up a couple of GET_IDX and SET_IDX functions that only 
use 20 bits for each index array entry instead of 32 (bringing the total 
block sorting memory usage down to 3125000 bytes), but all the extra anding, 
oring, and bit-shifting bloated the worst-case sort time to 22 seconds for a 
900k block of zeroes...

My first pass at bzip compression may not work as fast as jseward's, but it'll 
have a heck of a lot less code, and I can bolt the special-case-optimized 
sorting functions onto it later, after I get it to work...

> Jörn

Rob



More information about the busybox mailing list