Fork vs vfork. (Was Re: tar bugfix patch n.347 (recoded, version 4))

Rob Landley rob at
Fri Dec 23 00:01:31 UTC 2005

On Thursday 22 December 2005 06:05, Roberto A. Foglietta wrote:
> > +       if (nwrote == -1) {
> > +               bb_perror_msg("unzip write failed");
> > +               return EXIT_FAILURE; //any return or die will leak memory
> > in bb as static link applets
> > +       }
> >
> > We have a bb_perror_msg_and_die() that never returns, and is the same
> > size as bb_perror_msg() so you save the return.
>   I return EXIT_FAILURE because I see some applets need FAIL return code
> to delete tmp-file which are not inflate correctly, in particular when
> diskspace is not enought to complete writing. So is what I understood of
> use of this code. If I am wrong I spent only a return too much before
> end. I think optimization are important in loop, not so much in exiting.

Whether you're in a loop or not matters for speed, but not for size.  Busybox 
is optimized for size, and "four bytes here, four bytes there" adds up.

What we'd ordinarily do for this kind of cleanup is register an atexit() 
function and a global variable, and have something like:

char *tempfile=NULL;

void delete_out(void)
  if(tempfile) unlink(tempfile);

int main(...)
Possibly something like this should be in libbb, because sed -i cares about 
that and who knows what else?

> > Why are you referencing a SunOS 5 man page?  Do you want to port busybox
> > to Sun?  (Does sun only work if you use vfork() instead of fork?)
>   I removed the memory leak comment because I think you are right but I
> left frees before returning because I do not think they hurt or slowdown
> too much because:
>   if bb is not exiting, free are required
>   if bb is exiting so delaying it by two free does not hurt too much

Busybox is not optimized for speed.  Busybox is optimized for _size_ first, 
then we try to minimize complexity, and only after that do we worry about 
speed.  (We're implementing a set of command line utilities that were at 
least tolerably fast on a 486/33 DX, and the environment we're running in is 
not only way the heck faster but our optimization profile is entirely 
different these days.  The dominant performance killer these days is page 
faults.  Wandering out of L2 cache to dram is way more expensive than a 
cached function call.)

We tend to guard "optional" frees (ones that aren't in a loop over a dataset 
of unknown size) with if (ENABLE_FEATURE_CLEAN_UP) and let the compiler 
optimize them away when that's #defined to 0.

>   In the case frees make segmentation fault that probably hurts much
> more in non-exiting case.

I have no idea what you just said.

At a wild guess: note that free(NULL) is in the standard as ok to do.

> Memory leaks in my opinion could happen when a 
> bb generates zombies even in this case are not *really* memory leaks.

Zombies have nothing to do with memory leaks.  Zombies are processes that 
exited with a damaged parent (that neither set SIGCHLD to ignore nor did a 
wait upon receiving it), so they wait indefinitely to deliver their exit 

>   Zombies could happen and they waste the system memory, if I make frees
> before exit I should waste much less memory because zombies are as
> little as possible.

If zombies happen you have a bigger problem.  Don't try to strap foam rubber 
to the front of the car just in case the steering wheel breaks off.

>   About vfork/fork and _exit/exit the only think I can say for sure is
> my great ignorance. So actually the only things I know are from here:

Darn it, I'm trying to think of a good programming basics reference and not 
coming up with anything.  I started programming when I was 12 on a commodore 
64, started doing C under dos in 1990, learned most of the Unix API via the 
EMX unix emulation package under OS/2, and by the time I bit the bullet and 
spent the several weeks of effort necessary to make Linux to be my primary 
desktop (XFree86 hated every video card I ever owned through the end of 
1998), I was doing most of my programming in Java...

Ok, vfork() isn't really fork.  In theory, vfork suspends the parent process 
and shares all the memory with the child until the child does exec, at which 
point the parent wakes up and resumes from the fork().  In reality, the 
parent sort of does a setjmp and then continues on pretending to be the child 
until the exec() comes around, then the _exec_ does the actual fork, and the 
parent does a longjmp back to the original vfork call and continues on from 
there.  (It thus becomes obvious why the child can't return from the function 
that did vfork before the exec() call.  When the parent later tries to return 
after it resumes, it's using a section of stack that's been stomped on by the 
child, and hilarity ensues.)

The reason vfork exists is that if you haven't got an MMU then you can't 
simply set up a second set of page tables and share the physical memory via 
copy-on-write, which is what fork normally does.  This means that actually 
forking has to copy all the parent's memory (which could easily be tens of 
megabytes).  And you have to do this even though that memory gets freed again 
as soon as the exec happens, so it's probably all a big waste of time.

This is not only slow and a waste of space, it also causes totally unnecessary 
memory usage spikes based on how big the _parent_ process is (not the child), 
and these spikes are quite likely to trigger an out of memory condition on 
small systems (which is where nommu is common anyway).  So although you _can_ 
emulate a real fork on a nommu system, you really don't want to.  With nommu 
(specifically no ability to mark pages copy-on-write), vfork is a much better 
alternative.  Anywhere else, it's just a gross hack to fake fork via longjmp 
and let exec do the actual fork later.

Note a common mistake: the need for vfork doesn't mean you can't have two 
processes running at the same time.  It means you can't have two processes 
sharing the same memory without stomping all over each other.  (That's why 
the parent has to go to sleep until the child does its exec.  They're sharing 
the same stack, so if the parent calls a function and the child calls a 
function at the same time, they'll stomp on each other's callstack.  Also, 
any local variables the child modifies will retain the modified values when 
the parent resumes.  It's really easy to shoot yourself in the foot here.)

Now in theory, a nommu system could just copy the _stack_ when it forks (which 
presumably is much shorter than the heap), and leave the heap shared.  In 
practice, you've just wound up in a multi-threaded situation and you can't do 
a malloc() or free() on your heap without freeing the other process's memory 
(and if you don't have the proper locking for being threaded, corrupting the 
heap if both of you try to do it at the same time and wind up stomping on 
each other while traversing the free memory lists).  The thing about vfork is 
that it's a big red flag warning "there be dragons here" rather than 
something subtle and thus even more dangerous.

For busybox purposes, what we probably want is something in libbb that looks 
like "int fork_exec(char *command, ...)".  This would combine both the fork 
and the exec in a single function, so whether or not it was implemented via 
vfork or via fork would be irrelevant to the rest of the system.  Thus we 
could happily implement it via vfork, make sure there were no other vfork() 
calls anywhere else in busybox, and then say "ok, if you need fork-and-exec 
then use this, and only call fork if you need a real fork" which should 
almost never happen in busybox.

And then hopefully we never need to worry about vfork again...

By the way, _exit() is just a wrapper for the system call.  It doesn't do 
cleanup (like running the atexit list) but simly tells the OS "kill this 
process now", and leaves cleanup to the OS.  We do _not_ want to sprinkle 
calls to the sucker around.  Thinking it'll somehow help with zombies or 
memory leaks says to me that you really don't understand what's going on.

>   which document are talking about zombies too and how to preventing
> they happen. Solaris docs was read just because I worked on solaris hw

My condolences.

> but actually I do not have any such hw.

Busybox is Linux.  We've been ported to other platforms, but our primary 
deployment platform is Linux, it's the platform our semantics have to make 
sense on, and for anything else we're ported to that port will be relative to 

Of course within Linux itself we've got several different processors with 
different endinannesses, different word sizes, alignment constraints (some 
platforms actually care about unaligned access, which is sad but true), plus 
the whole nommu issue.

There are also a few version differences between Linux of today (what we're 
writing to) and Linux of five years ago (about as far back as we really care 
to support in new versions.  Maybe 7 years on a clear day with a tailwind, if 
the bug submitter is polite.  For anything older, they have the source code 
and if they're going to upgrade busybox they should upgrade the dependencies 
of busybox too.  Nobody's forcing them to upgrade.)  But differences between 
Linux versions aren't nearly as big of an issue as x86 vs arm.

Linux itself is quite capable of keeping our diversity budget occupied.  We 
generally only poke at supporting other platforms when it's easy and when the 
corresponding busybox applets on Linux are already quite stable and working 
well.  (And when somebody else steps up and volunteers to do the work. :)

> > Your makefile changes are unrelated this this patch, and the "rmdir
> > docs/" is actually _dangerous_ because the website
> > is copied from the version in svn once an hour.  So the version in the
> > tarball is in fact supposed to be there, in svn at least.  (We can remove
> > it from actual releases but why?)
>   You are right in that patch have not appear any makefile changes.
>   Anyway I changed makefile because diff produce an output in which:

Don't include stuff you don't intend me to merge in the patch, please.  (This 
is called sending in a clean patch.)

>   - *.c~ was found
>   - docs/ empty dir was found
>   Please note that rmdir without any option do not remove directory
> containing some files/subdirs but only empty dir (at least on my
> workstation, alias may happen). So I do not think it could hurt but
> simple remove diff issue.

I still ain't merging it.

> > It was tested!  That's what I like to see. :)
> >
> > Looking in the code, the redundant error message is a special case you
> > put in, right before an already existing return(EXIT_FAILURE).  If you've
> > already gotten an "unexpected end of file" message, why print another? 
> > (Is that first message _only_ printed for compressed archives, and you
> > need the zero length one for uncompressed tarballs...?)
>   Because the second messagge happen in the parent process and cause it
> to die with exit code 1, otherwise it would exit with 0 even if child
> failed.

So the child is identifying itself as "busybox" in the error message, and the 
parent is also identifying itself as "busybox" in its error message.

I have code somewhere around here to make tar just fork/exec a copy of gzip or 
bzip, via $PATH.  (Somebody posted a patch that was applied to 1.01 but 
needed cleanup for 1.1.)  I should dig that up...

Sigh.  It's timing.  I keep seeing "in order to fix this properly, I should 
make this infrastructure change" and I'm _trying_ to get this all closed down 
for release.

Hmmm...  Sounds like what I need to do is cut a -pre2 and defer the bugs that 
are "too much work to fix before -pre2, but need to be fixed before 1.1.0".  
(Like the darn remount issue I'm wrestling with in the portion of my copious 
free time that the holidays and visiting friends haven't consumed)...


> The first message happen in the child, but open_transformer did 
> not have the capability to catch the child's exit code. The patch
> attached in this mail make open_transformer able to catch child exit
> code. I read that code should prevent zombies happen too.

If you ask me, open_transformer is kind of evil, really.  Fork and feed data 
through a pipe.  (I mentioned having a patch for this somewhere.)

>   Even the open_transformer has changed in this way, I think that zero
> lenght msg are good because it happen in *that* proccess, or happen for
> zero lenght file/data or not happen at all if the child management was
> done. So I do not removed that message.

A zero length file is not special.  For normal gunzip:

  echo -n | gunzip
  gunzip: stdin: unexpected end of file

And you think busybox (trying to be a _smaller_ implementation than normal) 
should have a separate string printing a separate message?

>   These outputs are related to the patch attached:
> [roberto at wsraf busybox-1.01_raf]$ ./busybox tar tvzf /tmp/test.tar.gz
> -rw-r--r-- 0/0    133490 2005-11-18 10:20:23 ../volantino_LD05.pdf
> [roberto at wsraf busybox-1.01_raf]$ ./busybox tar tvzf /tmp/test.tar.gz0
> tar: unexpected end of file  <-- happen in the child
> tar: child exit with error code 1 <-- reported in the parent
>   The second message could be removed but I do not like a parent which
> die for an error without saying anything because nobody could tell me if
> that child had a verbose or silent death.

I can see that.  The problem there is that the child is identifying itself as 
tar when in fact the error is from gunzip.

>   Sorry I do not understand this, probably my english is not good
> enought. If you was saying you would like to see another patch...
>   ...look at the tail of this msg.

And it's on my todo pile.

You're right that we need more documentation.  Some kind of programming FAQ.  
I've started one and am working on it.  Hope to check it in soon, but I'm at 
some friends' house and shouldn't ignore them so much. :)

Steve Ballmer: Innovation!  Inigo Montoya: You keep using that word.
I do not think it means what you think it means.

More information about the busybox mailing list