[BusyBox] Newline doesn't match $

Mark Whitley markw at lineo.com
Tue Jun 13 23:21:04 UTC 2000


On Tue, Jun 13, 2000 at 10:55:51AM -0600, Erik Andersen wrote:
> On Tue Jun 13, 2000 at 11:24:04AM -0400, Pavel Roskin wrote:
> > Hello!
> > 
> > We have a problem with regular expressions - '$' doesn't match the
> > newline:

[snip]

> 
> Yeah.  This one has been listed as a known bug in the TODO file
> for quite a while now.  Mark Whitley has looked into what it will
> take to fix it, and he has pronounced it to be difficult.  If you
> have some time to try and fix it, that would be great!

Here's basically the (glaring) problems with the current implementation that
I've discovered:


1) The current grep/sed/regex stuff is using static buffers
-----------------------------------------------------------

Have a look at do_grep in grep.c, haystack[BUF_SIZ]. The static-buffer
approach has two problems: 

	- it makes the busybox executable larger and 
	- what happens when you try to perform regexes on a line larger than
	  BUF_SIZ?

The right approach, IMHO is to do something like this:


/* globals */
char *haystack = NULL;
int   haystacksz = 0;
const int HAYSTACK_GROWBY = 80;

...

/* (inside function that loads haystack with lines from file) */

{
	int infd; /* file descriptor of input file */
	char ch;  /* one character read from file */
	int idx = 0;

	infd = open(filename, O_RDONLY);
	if (infd < 0)
		return SOME_ERROR;

    while (read(infd, &ch, sizeof(ch))) {

        if (idx > haystacksz-1)
            haystack = realloc(haystack, haystacksz += HAYSTACK_GROWBY);

        haystack[idx++] = ch;

        if (ch == '\n') {
            haystack[idx] = 0;
            return haystack;
        }
    } 

	/* (rest of function body...) */
}


Or something to that effect.


2) The regexec() interface is a little twisted
----------------------------------------------

The current regexec() functions kinda look like they were written in a hurry
and had metacharacter support (^$.*? and friends) cobbled together as an
afterthought. The interface should be cleaned up to look something like this:

	int regexec(const struct regexp *re, const char *str, const int flags)

Note that I am deliberately avoiding adding support right now for
sub-expressions. The current implementation does not support sub-expressions
and adding such support would be kind of a pain. Best to leave it for later.

The existing regcomp() interface looks fine:

	struct regexp *regcomp(char *text) 

But we might want to add a regfree() to act as a destructor that can
complement regcomp, to wit:

	void regfree(struct regexp *freeme)

The rationale for adding a regfree is thus: if/when the 'struct regexp' data
type becomes more complex , possibly involving malloc'ed fields (e.g. when
dynamic buffers or sub-expressions are added), we want to be able to cleanly
destroy the whole ADT properly so as to avoid memory leaks.


3) Validation/execution of re's is missing/broken
-------------------------------------------------

The TODO file lists problems with 'grep *foo' (segfault), 'grep foo$' (doesn't
match properly) and others. These are symptoms of a deeper problem involving:

	- validation of impossible re's (*foo) is not being done properly and
	- execution of valid regexes is is not being done correctly (foo$)

The logic of the validation/execution functions just needs to be fixed and
tested; no magic bullet here.


Bottom Line
-----------

The sum of these problems calls for a rewrite of the existing regex functions,
and very probably a rewrite of grep, too; hopefully, sed can just be modified
a bit and not rewritten completely (we'll see). Now that I understand the
problem a little better, I think I can start to rewrite some of it. It's been
hard to find time to do this, but I think I can plow ahead now.


Disclaimers
-----------

In making observations about the shortcomings of the existing regex stuff, I'm
not trying to condemn or even slight anybody in any way, just calling the
plays like I see them, and trying to offer constructive criticizm rather than
the other kind of criticizm :-).

I'm pretty sure I can churn out something that will be better than the
existing implementation, but the only guarantee that I will make is that it
won't be perfect. Hopefully, it will be better, though. Making the transition
might be a bit tricky. When the time comes to drop the new stuff in place,
I'll ask for help testing. Thanks in advance.


Mark Whitley
markw at lineo.com





More information about the busybox mailing list