rewriting config_, xmalloc_fget

Wed Jun 15 13:11:05 UTC 2011

On Wed, Jun 15, 2011 at 04:04:52PM +0300, Timo Teräs wrote:
> Mmm.. yeah, uclibc needs fixing here. I'm still slightly doubtful how
> much the performance difference would be actually. It probably depends
> on the line lengths totally. On small lines, it's not much. On really
> long lines, it could be essential.

I would expect the difference to be the combined difference of
"getc_unlocked vs *p++" (small if getc_unlocked is a macro) and
"optimized-memchr vs naive-memchr-like-loop" (possibly 2-3x). My wild
guess would be 2x faster on short lines and 3.5x faster on long lines.

> Locking is not performed for pthread_mutex_* in this case.
> 
> However, glibc and uclibc with nptl, do locking for stdio stuff. The
> reason is that both support late loading of libpthread with dlopen. This
> is why both implementations use special locking primitives for stdio. In
> glibc case it's the lowlevellock stuff or lll_lock. It's done
> unconditionally, and always. And does at least one atomic instruction
> per lock acquire.

I don't see why they can't put "if (multithreaded)" before the lock
(where multithreaded is a global variable of some sort). That's
basically the approach I take in musl (actually I use the current
thread counter, which is possibly worse since it might change often,
but it allows apps that are only "temporarily multithreaded" to go
back to "max performance mode" after threads terminate.

In any case, the difference is *very* small at least on x86 (including
SMP/multicore). The cost of the "lock xchg" vs "if (threads>1)" is
something like 50 cycles, and if I remember right, making my
implementation always perform locking only slowed it down by about
20%. (I say only because this translates into very small difference in
total program time unless your program is "for (;;) free(malloc(1));")

Perhaps the cost is much larger on other archs where atomics are more
expensive...

> For the getdelim stuff, it really depends on the input files which is
> faster, and probably needs benchmarking.

Ignoring the issue of uclibc's getdelim, I'm pretty sure getdelim will
always be faster, and usually by a significant margin. (Unlike malloc,
IO is where the program is spending a majority of its time, so making
IO 2-3x faster would make the whole process roughly 2-3x faster.)

Rich

rewriting config_*, xmalloc_fget*

rewriting config_, xmalloc_fget