gawk vs. BusyBox awk treatment of FS empty matches

Wolf wolf at wolfsden.cz
Sun Nov 1 20:22:08 UTC 2020


Just my $0.02 on this:

On 2020-11-01 20:02:24 +0000, David Čepelík wrote:
> I've noticed an interesting discrepancy between gawk and BusyBox awk:
> when the FS is set to e.g. ` *` (space asterisk), gawk will not consider
> empty matches of the regular expression (see e.g. [1]) while BusyBox
> will. This example demonstrates it:
> 
> ~% gawk --version
> GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.1.0, GNU MP 6.2.0)
> [...]
> ~% echo 'foo bar' | gawk -F' *' '{print $1}'
> foo
> 
> While BusyBox (a7c065354) will produce:
> 
> 1! ~/sw/3rd/busybox:master% echo 'foo bar' | ./busybox awk -F' *'  '{print $1}'
> f
> 
> Is this desired behavior? To my best knowledge, this isn't standardized.

I would argue that it is and that gawk is in violation of the standard.
The sentence I'm basing this claim on is (from [0]):

> Otherwise, the string value of FS shall be considered to be an
> extended regular expression. Each occurrence of a sequence matching
> the extended regular expression shall delimit fields.

Empty string is in my opinion match of the ERE provided, so the
busybox's behavior seems to be the correct one.

> Would it make sense to harmonize BusyBox's implementation with GNU Awk?

That is separate question. I do not know if diverging from standard just
to make this behave same way as gawk is good idea. In my opinion fixing
gawk to comply is the correct choice (so consider filling this as a bug
report with them).



W.

[0] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html

-- 
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.busybox.net/pipermail/busybox/attachments/20201101/15955c7f/attachment.asc>


More information about the busybox mailing list