bunzip2 fails to decompress pbzip2-compressed files

Sat Nov 6 23:40:22 UTC 2010

On Friday 05 November 2010 20:27:47 Denys Vlasenko wrote:
> On Wed, Nov 3, 2010 at 7:09 PM, Rob Landley <rob at landley.net> wrote:
> >> > I thought it was inherent in the mandate of the project, but
> >> > apparently not. The focus these days is on features, adding more and
> >> > more, always making the project bigger and more complicated.
> >> >
> >> > I look around and everywhere see things that aren't that hard to clean
> >> > up,
> >>
> >> Which ones (except those mentioned in TODO)?
> >
> > It's sort of a constant background thing.
> >
> > If you want a specific example, there's bound to be a way to simplify
> > editors/vi.c.  Or miscutils/less.c.
>
> Ohh, I *gladly* would take patches which simplify these.
> Or patches which fix them wrt Unicode. Or both.

My point was that finding this stuff is easy.  Dealing with it is the part that 
requires a lot of time and careful thought.

Making any change to busybox requires reading through the code that's there to 
gain a broad enough understanding of it that you're not making it worse.  And 
I can't do that anymore without coming across buckets of tangents that need 
doing, and I tend to lose track of my original goal.

Last time I seriously engaged with BusyBox it took over my life for a couple 
years.  Which meant the rest of my Linux work essentially rolled to a stop for 
a while.  Now I'm back getting a minimal native development environment to 
boot and run on over a dozen different hardware architectures that QEMU 
emulates, and getting existing architectures to _keep_ working is a heck of a 
Red Queen's race:

  http://kerneltrap.org/mailarchive/linux-kernel/2010/9/4/4615621/thread
  http://www.mail-archive.com/qemu-devel@nongnu.org/msg27071.html
  http://www.openfirmware.info/pipermail/openbios/2009-March/003601.html
  http://lkml.indiana.edu/hypermail/linux/kernel/0705.1/1962.html
  http://kerneltrap.org/mailarchive/linux-kernel/2010/2/22/4540565

And so on.  Not counting the perl removal patches I still haven't gotten 
upstreamed into the kernel, or the pending uClibc NPTL mess, or my supposed 
goal of bootstrapping Linux From Scratch, Gentoo, Fedora, and Ubuntu to 
natively under the resulting system.

Or other things I _want_ to do like learn Lua and reimplement toybox in it, 
turn tinycc into qcc by ripping the back-end off and replacing it with QEMU's 
TCG, or testing out all the new device tree stuff that's going into the kernel, 
or helping out with the new llvm/clang work to come up with a viable 
replacement compiler for the political morass GCC has become,  or do a "hello 
world" kernel for each target stripped down enough that with a bit of kexec 
magic you could use Linux as its own bootloader, or reinstalling my laptop 
with Gentoo instead of an Ubuntu version so stale the update manager gave me a 
"no more updates, upgrade already" pop-up last week... 

(Or the whole "get a new day job" thing since my contract with Qualcomm ran 
out on the 31st and the department's new budget won't be approved before 
january at the earliest so they couldn't renew it.  But I'm used to having 
time between contracts, that's when I get the bulk of my open source 
programming done. :)

It's not that I don't want to work on busybox, it's that the scope of the 
problem is beyond the time commitment I can offer.  The project is pervasively 
messy, and continuing to get messier, to the point where just poking at it a 
couple days a month can't hope to keep up with the continuing influx of mess.

I keep bookmarking things like this:

  http://lists.busybox.net/pipermail/busybox/2010-October/073518.html

Which was a 30 second fix: all those #ifdefs could be if() statements.  I 
realize you don't see this is a problem, but I do.  Henry Spencer nicely sums 
up why here:

  http://doc.cat-v.org/henry_spencer/ifdef_considered_harmful

And Greg Kroah-Hartman covered it in his kernel coding style talk (this slide 
and the next two, and page 6 of the corresponding paper):

  http://www.kroah.com/linux/talks/ols_2002_kernel_codingstyle_talk/html/mgp00029.html
  http://www.kroah.com/linux/talks/ols_2002_kernel_codingstyle_paper/codingstyle.ps

But by the time I read that message on the mailing list you're already applied 
it, and by the time I sat down to deal with the resulting code it had changed 
again to an even denser forest of #ifdefs, and if I have to argue about _why_ 
removing them is a good thing that takes even more time...

And when people ask where the mess needing cleanup is as if they can't see it, 
or act like #ifdef removal is black magic it takes special talent to do, or 
when you say you wonder how I came up with such a small sha1sum 
implementation... I find that really depressing.

I am not a very good coder.  By my standards, I suck at this.  I really do.  I 
just don't let sucking at it stop me from trying to figure out how to make it 
suck _less_.  The fact that I can't always manage doesn't make the goal any 
less worthwhile, and I would LOVE if other people could do a better job at 
this so I didn't have to.

You don't see the code I throw away, or all time time I spend _thinking_ 
before coding.  One of the reasons I tend to have three or so open source 
projects ongoing at once (when I'm not just banging out some barely functional 
schlock to make a deadline and HOPE they throw it away afterwards) is that I 
get writer's block.  Not because I can't figure out how to make it work but 
because I can't figure out how to do it RIGHT.  Because I haven't yet convinced 
myself I've minimized the suck.  I haven't got the DESIGN right, which means 
I'm not thinking about the problem the right way yet, and it's far easier to 
tell I've got it wrong than to figure out what right is.

You'd think the BusyBox project would be the right place for computational 
Dorodango if anywhere was, especially five years after its' 1.0 release where 
it's supposedly code complete and presumably implementing all of the Single 
Unix Specification's command line stuff it cares to.  You'd think the focus 
would switch to doing what it already does better.

But no, the focus is on adding more to the project.  New commands, new 
features, new complexity... *shrug*

> All these ideas seem like good ones to me.

Again, see "belling the cat".  Ideas are not the limiting factor.

> >> I wouldn't say 'nobody'.
> >
> > It is no longer the majority opinion.
>
> I actually look at code size VERY closely. See my other recent mail
> where I show that there is, on average, net reduction in size
> since 1.00 on the same config.

Great, but there's a pitfall here.  You know how pointy haired managers love 
quoting the phrase "You can't manage what you can't measure"?

  http://www.galorath.com/wp/you-can-manage-what-you-cant-measure.php

To which the rebuttal is Einstein's quote, "Not everything that counts can be 
counted, and not everything that can be counted counts":

  http://www.anecdote.com.au/archives/2006/09/if_you_cant_mea.html

The failure mode is "managing what you can measure".  Focus on what you can 
measure and consider it more important than what you can't.

BusyBox doesn't have a metric for simplicity, but it does have a metric for 
size (and to a lesser extent functionality: you can enumerate new features and 
even add regression tests for them).  Thus size/features are what you think 
about, what you constantly check, it becomes more important over time, and 
comes to eclipse simplicity.

This gets you in big trouble when the metric you've got is only a proxy for 
the thing you really want, because people game the system.  When IBM focused 
on KLOCS (thousands of lines of code) as a programmer productivity metric, 
their employees cut and pasted their way to productivity bonuses and the 
actual code quality suffered tremendously until they stopped incentivizing that 
metric.

This isn't just a programming thing, it's the same failure mode which leads 
corporations to take away the free towels in the gym, because the cost of 
providing the service can be easily measured but the morale boost it gives the 
employees can't, therefore one is "real" and the other isn't.

Even when the metric is real and important, it tends to lead to the things you 
can't as easily measure getting ignored, because they're harder to think 
about.  You have to make an effort to see them.  It's an easy trap to fall into 
and a hard problem to solve, especially when the things you can measure are 
good things and important to get right, so spending time on them isn't 
necessarily _bad_...  They're just not the whole story.

That's part of the reason I personally valued simplicity _above_ the other 
two.  Precisely because it's harder to measure.

> I'm just doing it not in Rob's way ("rewrite this crap!"),
> but in "let's simplify this crap!" way.
>
> I invite Rob to rewrite any part he likes. Then I will
> try to simplify his rewrite. It's a win-win situation.

Been there, done that.

> >> > I've come to the conclusion I'm not helping here.
> >>
> >> From my point of view, you _are_ helping.  In your own way 8).
> >
> > No, I'm telling _myself_ to "shut up and show me the code".
> >
> > I just don't see it making a difference here with the amount of time I
> > have to put into it.  It's like trying to mop up a river, the new
> > arrivals bury any small gains I could make.
>
> New arrivals do help in one area: they reduce dependencies
> in LFS-type systems. You know it yourself since that's precisely
> the reason you use busybox in Aboriginal Linux:
> you want to have fewer packages.
>
> And when busybox acquires a new feature _you_ need, you see it
> as a win. (For example, 1.18.x will have brace expansion in hush).

I'm not saying more features is bad.  I'm saying the loss of simplicity is 
bad.  All three things trade off for each other, but one of them has taken it 
on the chin in favor of the other two.  It makes the project uncomfortable for 
me to work on, and I don't believe the amount of time/energy I have available 
to put in won't even keep up with the ongoing degradation of the quality.

And being only one who sees the imbalance as an actual _problem_ is 
discourging enough that I'd rather just not watch, thanks.

> Which makes it more useful. Which is good.
>
> Why you fail to extrapolate your feeling of a win when
> *others* submit stuff, and it is accepted?

I love it when other people solve my problems.  But having to solve a second 
set of problems other people created while solving the first set of problems 
isn't always a net win.

I'm already running a red queen's race over on the kernel and qemu side, and 
LLVM will be another, and when I start caring about the unreleased uClibc code 
that will be another, and when I get distros natively bootstrapped the 
bootstrapping logic will bit rot too.

Mostly I want to push this stuff upstream.  If nothing else, automate the bug 
reporting and git bisect to the commit that broke test case X.  (That's half 
of what the cron job stuff is for.  Alas impactlinux.com went away this past 
week and it's all back on landley.net now which hasn't got the bandwidth for 
heavy use...  Yet another todo item I hadn't planned on...)

Cleaning up busybox is only an issue if I'm going to be developing on busybox.  
It's just my weird aesthetic sensibility that obviously isn't important to the 
metrics the project uses to measure itself, and I have enough self-imposed 
tasks for the moment, thanks.

> They are in exactly
> the same position as you: they don't want to cross-compile
> a $HUGE_BLOATED_PACKAGE, so they reimplement or port
> part of it to busybox.
>
> Same thing.

Actually in my case I want to keep the set of environmental dependencies as 
simple as possible.  The thing that really appeals to me about busybox is I 
can build it on a wide range of host systems without having to worry about 
whether that system has lex and bison and autoconf and automake and perl and 
python and zlib and internationalization support and 

It lets me worry about what I'm cross compiling _to_ and not have to worry 
about what I'm cross compiling _from_.  And that's valuable.

But it's the _simplicity_ that appeals to me, not the size or speed.  I'm 
building busybox "defconfig" because that makes my build scripts conceptually 
very simple, even though that literally enables a hundred more apps than my 
build actually needs.  Micro-managing busybox's config to strip it down would 
make it smaller, and make the _result_ simpler, but it would make my build 
more complicated, harder to maintain, harder for new people to learn, more 
likely to bit-rot...

I'm glad busybox does more things for more people, but I'd already implemented 
most of the feature set I personally needed to do this back in 1.2.2, and the 
remaining bits I've added since (oneit, patch, nbd-client, ccwrap, etc.) could 
live as individual files in sources/toys/*.c if necessary.  (You'll note I 
haven't pushed oneit into busybox, even though an init program designed to 
launch a single executable with a proper controlling TTY and signal handling, 
reap zombies until that executable exits, and then shut the system down... 
once upon a time that simplicity would have been in busybox's purview.  But it 
would have been a unique facility of busybox, not a copy of an existing 
program somebody else already wrote and maintains externally, and thus not 
part of busybox's _current_ mandate.  I never saw busybox as a shadow of other 
projects, but I was weird.  I went to a Weird Al concert last night where I 
bought the tour T-shirt I'm wearing now and shouted the civic motto "Keep 
Austin Weird" at him.  And the only two songs that were new to me were the 
"Polka face" medly at the beginning and the one about cellphones.  That's how 
weird we're talking here.)

I'm happy that busybox is well maintained, and I'm happy that if I post a bug 
report here it tends to get addressed promptly.  But my goals and busybox's 
goals have drifted apart over the years, and I'd rather spend the majority of 
my time elsewhere now.

Rob
-- 
GPLv3: as worthy a successor as The Phantom Menace, as timely as Duke Nukem 
Forever, and as welcome as New Coke.