rewriting config_*, xmalloc_fget*

Timo Teräs timo.teras at iki.fi
Wed Jun 15 13:47:39 UTC 2011


On 06/15/2011 04:11 PM, Rich Felker wrote:
> On Wed, Jun 15, 2011 at 04:04:52PM +0300, Timo Teräs wrote:
>> Mmm.. yeah, uclibc needs fixing here. I'm still slightly doubtful how
>> much the performance difference would be actually. It probably depends
>> on the line lengths totally. On small lines, it's not much. On really
>> long lines, it could be essential.
> 
> I would expect the difference to be the combined difference of
> "getc_unlocked vs *p++" (small if getc_unlocked is a macro) and
> "optimized-memchr vs naive-memchr-like-loop" (possibly 2-3x). My wild
> guess would be 2x faster on short lines and 3.5x faster on long lines.
> 
>> Locking is not performed for pthread_mutex_* in this case.
>>
>> However, glibc and uclibc with nptl, do locking for stdio stuff. The
>> reason is that both support late loading of libpthread with dlopen. This
>> is why both implementations use special locking primitives for stdio. In
>> glibc case it's the lowlevellock stuff or lll_lock. It's done
>> unconditionally, and always. And does at least one atomic instruction
>> per lock acquire.
> 
> I don't see why they can't put "if (multithreaded)" before the lock
> (where multithreaded is a global variable of some sort). That's
> basically the approach I take in musl (actually I use the current
> thread counter, which is possibly worse since it might change often,
> but it allows apps that are only "temporarily multithreaded" to go
> back to "max performance mode" after threads terminate.

They probably could. They already have other hacks, like jumping over
the LOCK prefix byte if we are not SMP system. Though, there might be
limitations on what they wanted to imposed by (but not limited to) how
the nptl and libc separation is done (these are two separate, but deeply
intertwined libraries), how forking and/or signals are handler when
stdio locks are held, or how the libpthread can be late dlopened to the
process (which already includes lots of hacks to make the
pthread_mutex_* avoidance to work).

However, my point was just, that the two most used libraries do incur
the locking overhead... even if you got it right in musl.

> In any case, the difference is *very* small at least on x86 (including
> SMP/multicore). The cost of the "lock xchg" vs "if (threads>1)" is
> something like 50 cycles, and if I remember right, making my
> implementation always perform locking only slowed it down by about
> 20%. (I say only because this translates into very small difference in
> total program time unless your program is "for (;;) free(malloc(1));")
> 
> Perhaps the cost is much larger on other archs where atomics are more
> expensive...

The figures depend on the specific application.

But still, why do malloc() ... free() in relatively tight inner loop, if
there's a clean way to avoid it? I don't see point doing "fast things"
when we don't have to do them at all. When reading 100.000 or a million
lines file, avoiding that many malloc/free calls (or more) is visible on
execution time.

>> For the getdelim stuff, it really depends on the input files which is
>> faster, and probably needs benchmarking.
> 
> Ignoring the issue of uclibc's getdelim, I'm pretty sure getdelim will
> always be faster, and usually by a significant margin. (Unlike malloc,
> IO is where the program is spending a majority of its time, so making
> IO 2-3x faster would make the whole process roughly 2-3x faster.)

Well, it'd be "locking over head + function call overhead" vs. "inner
loop getc doing buffer length checks". And yes, getdelim will likely win
already on not too long line lengths. The longer line, the bigger the win.

Though, in bb config_* API, getdelim isn't always enough. They seem to
need a special line reader that treats both \0 and \n as line terminator.

This yet another reason why the BB config_* API needs a rewrite. Those
places should use different function for that (that can do the slow getc
stuff) and we could optimise the more common line reading (with either
only \0 or \n as terminator) to use getdelim.

- Timo


More information about the busybox mailing list