Toybox patch for busybox.

Rob Landley rob at landley.net
Mon Aug 16 18:45:29 UTC 2010


On Friday 13 August 2010 08:54:11 Denys Vlasenko wrote:
> On Thu, Aug 12, 2010 at 8:26 AM, Rob Landley <rob at landley.net> wrote:
> > 3) you may want these patches:
> >
> >
> > http://landley.net/hg/toybox/rev/362
> > http://landley.net/hg/toybox/rev/375
> >
> > And optionally some debug crap:
> > http://landley.net/hg/toybox/rev/376
>
> I just took the latest source from hg and put it into patch_toybox.c,
> then in scratch build renamed it to patch.c.

Woot.

> I need a few changes/fixes there before I can put it into bbox:
>
> (1) It fails one test which current bbox's one does not fail.
> With this patch:
>
> --- input.doesnt_exist        Jan 01 01:01:01 2000
> +++ input       Jan 01 01:01:01 2000
> @@ -1,2 +1,3 @@
>  qwe
> +asd
>  zxc
>
> "patch -R" tries to patch file named "input.doesnt_exist"
> whereas it should patch file "input".

Actually there's no standard about which one it should patch when the two 
filenames differ.  (I researched this when implementing patch.  This is why 
everybody does -p1, so that the old and new paths under the directory you diff 
can be identical and this issue gets glossed over.)

Of course gnu patch has heuristics to try to guess what you'd like to do.  (It 
doesn't change what you typed behind your back and pop up a talking paper 
clip, but you can tell it _wants_ to...)  And like fuzz factor, it can 
occasionally do the wrong thing and bite you.  But my research was long enough 
ago that I remember the annoyance but not the details...

Speaking of research, digging into my blog, the "Right, screw it.  Time to 
implement patch in toybox." entry (when I broke gnu patch) was December 11, 
2007.  And then the rest of the month is full of comments on patch.  I should 
dig them up and post them here for posterity:

  Implementing patch for toybox, I am learning _so_ much about the weird
  corner cases of diff and patch.  (This is a fairly common occurrence, I
  learned tons about mount implementing it, learned sed by implementing it,
  and so on.  For one thing, unified diffs can't have hunks out of order.  (If
  they do, the gnu version of patch fails with a specific error message for
  this.  If you reorder and then renumber them, the first one applies with an
  offset and the second fails.)  So it's doing a simple one pass
  substitution.

December 15, 2007:

  Got the first half of patch checked into toybox.  The patch output is
  going to stdout, but it seems to be applying hunks properly now.  With an
  offset even, albeit no fuzz factor yet.

  I need to think more about the offset logic.  Right now it doesn't care
  where the hunks say they are (in terms of starting line number), it just
  tries to apply them all in order based on context lines.  This is probably
  good enough for most real world uses, although the downside is that a single
  failed hunk means all the other hunks in the patch fail too.  In order to
  improve on it I'd probably want to read in all the hunks for a file and try
  to apply them in parallel.  Worry about it later...

December 17, 2007:

  Banging on patch.c.  Got it working... except for inserting new files,
  which breaks a couple of rules.  (Patching a file that doesn't exist,
  zero context lines.)  Deleting an entire file is another unimplemented corner
  case, again zero context lines and deleting the resulting file.  Also, right
  now I'm using the +++ line as the filename, which works for everything
  except comparing a file with /dev/null to delete it.  Sigh.

December 19, 2007:

  One more swing at patch (I keep getting distracted from it).  Got add
  and remove implemented, although not debugged yet.  I need to make it
  understand the git "move" syntax too, because renaming files is one of
  those things people actually do, and "delete whole file, insert whole file"
  is a _horrible_ way to represent that.

December 26, 2007:

  Implemented the rest of patch add/remove in toybox.  Comment from 60
  seconds ago, "That may actually have worked".  Firmware Linux seems to be
  building with it.  Fingers crossed...  I still need to implement
  "move" support, ala git.  And I need to go re-read Brahm Cohen's blog
  (the creator of bittorrent) to see what his new diff algorithm was all
  about.  But hey...  Progress.

December 28, 2007:

  And patch broke in the uClibc build, because if a hunk is right up against
  the end of the file there aren't as many trailing context lines as starting
  context lines.  (There may not, in fact, be any.)  So fewer ending than
  starting context lines means match end of file.  The start of file case is
  sort of already handled by the current code; with zero context lines we
  match immediatey (start of file unless it's not the first hunk, which is a
  "don't go there" case).  Trimming trailing context lines shouldn't hurt too
  much because the start of the file is easy to find reliably.

December 29, 2007:

  Yet another corner case in "patch".  I'd assumed there would always be
  at least one non-patch line between each patch, so what I was doing was
  naievely reading each hunk until the first line that didn't start with a
  plus minus or space char, and then trimming any extra context lines (more
  than the starting context line).

  This was _really_stupid_, and I honestly don't know what I was thinking.
  It was small, simple, and wrong.  Here's how that breaks when you cat
  multiple patch files together:

    --- filename
    +++ filename
    @@ -42,7 +42,7 @@
     context
     context
     context
    -line
    +line
     context
     context
     context
    --- filename
    +++ filename
    ...
  
  The length of the patch is given in the @@ line at the top: the -xx,7 means
  7 lines of old content, the +xx,7 means 7 lines of new content.  Use that
  and concatenating patches is no problem (and context line trimming isn't
  necessary either, so it actually probably winds up smaller.)

  My code's even already parsing @@ lines, although I just through it would
  just be for error reporting purposes.  The -42 and +42 are the starting line
  at which to expect the start of the hunk (in the old and new file,
  respectively), which is only really useful to tell you at what offset you did
  find the hunk.  I even thought about searching for multiple patch locations
  and taking the closest one, but traditional patch just locks to the first
  three matching lines of context and fails the hunk if the rest of the patch
  doesn't match.  (Unless of course you implement "fuzz factor" support, which
  is often <a href=http://kernelslacker.livejournal.com/7336.html>a bad
  idea</a>

  There is a corner case where repeated context lines could make it miss a
  patch location and fail to apply.  If the first three context lines are "aab"
  and the file has "aaab", then matching the first two and failing the third
  would flush those three already-evaluated lines from the previous supposed
  match to the output, and they wouldn't be re-scanned for another match later
  down.  This is an "I can fix that, but is it worth enlarging/complicating the
  algorithm"?

  I could make the thing _darn_ reliable (and fuzz factor implementation
  trivial) by loading the whole file into memory and running each hunk against
  the in-ram copy.  It could apply overlapping or out-of-order hunks, use the
  closest match in case of duplicate match sites...  But it would also eat
  much more memory at runtime, and I'm trying to do embedded software...

  Maybe I could do something with mmap(MAP_PRIVATE)?  Eh, worry about it
  later, get this implementation working first.  And _that_ means a rewrite of
  the hunk reading code.

That's it through 2007, I should dig up the 2008, 2009, and 2010 entries and 
post 'em too, but I'll worry about that later, this is already long enough...

> (2) printf("patching %s\n", name) should be changed into
> printf("patching file %s\n", name) to mimic standard patch.

If you'd like.

Most of my work on this predated SUSv4, which _finally_ mentions -u:

  http://www.opengroup.org/onlinepubs/9699919799/utilities/patch.html

It might be nice to compare what the standard finally got around to saying with 
what reality's been doing for the past 20 years, if only for the amusement 
value.  (for example, the new diff standard insists that the timestamp always 
be output, when in reality it's optional.)


> (3) it does not support -N. bbox one does.

It's got a longish todo list of things that could be added.

> I fixed (1) and (2) and committed the result to git. So you can pull it
> and hack on it.

I'm not sure "fixed" is the right word for (1), I suspect you just grabbed one 
of the many behaviors the gnu heuristic could squeeze out, which may or may 
not have been an improvement.

If you want to follow the gnu-gnu-gnu-dammit project's heuristic, then try the 
+++ name first and if it's not there try the --- name.  Except for reversed 
patches, where you try them in the opposite order.  And of course when you're 
creating files said file not existing is actually your _success_ condition.  
(Deletion works like modification, you find the file, then you operate on it.  
Creation you have to guess right first try.)

I believe the logic is that the "+++" name is the one you're trying to turn 
the thing _into_, so it should be the first choice of name you have when you 
finish.  But when you reverse the patch, you go the other way.  Except there's 
a corner case where you're reversing a patch that deleted a file, and the goal 
isn't to figure out what the patch _says_ to do but to reverse what the 
creation operation would have done.

Most of the time both of the listed names aren't there, so you just pick 
whichever one exists.  But this opens the can of worms of getting consistent 
behavior when they _are_ both there, which turns out to be nontrivial.

I just punted and went with something simple and consistent, I think it was 
"I'm always using the +++ name for normal patching and --- for reverse, and if 
that's not good enough use -p1 already".

Hey, it turns out that Posix 2008 has an opinion on this:

  http://www.opengroup.org/onlinepubs/9699919799/utilities/patch.html#tag_20_90_13_02

And their filename determination section doesn't even mention the "reverse" 
option.  (Standards committees that don't quite understand the utility they're 
documenting, where have I heard that before...)

I wouldn't feel bound by their heuristic (for one thing, we're not prompting 
for filename, that never did anybody any good... and SCCS?  In 2008?  Really, 
POSIX committee?  You're _serious_?)

And of _course_ they don't document the git file delete/rename semantics.  That 
would be too obvious.  It looks like this, by the way:

  http://www.spinics.net/lists/linux-doc/msg00095.html

And is documented here:

  http://www.kernel.org/pub/software/scm/git/docs/git-diff.html

See "Generating patches with -p".

> I didn't fully "busyboxify" it, need to restore -N first (and add a test
> for it, we don't have one).

As I said, there's a todo list.  (In a comment, I think.)

It needs a config entry for the "extra" behavior, and a determiniation about 
what the minimal behavior _is_.  (Pretty much the stuff I implemented was what 
I actually needed to patch packages in the wild.  And that was with -p1, since 
that's what people who know what they're doing post to lists, and what git and 
hg produce...)

> > I'm also working on a patch to implement -l support (squash whitespace)
> > if you're interested.  And at some point I should do fuzz factor support,
> > and better error reporting.  And add a CONFIG symbol to chop out most of
> > the command line options...
>
> I will try to not get in the way (in patch.c) for now then.

Oh go ahead and have fun, I'll migrate my stuff when you get to a good stopping 
point. :)

Thanks for doing this, by the way.  I'm proud of the work I did in toybox, but 
it proved what I set out to, and that effort would be better spent cleaning up 
busybox...

Rob
-- 
GPLv3: as worthy a successor as The Phantom Menace, as timely as Duke Nukem 
Forever, and as welcome as New Coke.




More information about the busybox mailing list