httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Erenkrantz <>
Subject Re: remaining CPU bottlenecks in 2.0
Date Wed, 05 Sep 2001 04:46:58 GMT
On Tue, Sep 04, 2001 at 08:00:35PM -0700, Brian Pane wrote:
> I'm currently studying profiling data from an httpd built from
> a CVS snapshot earlier today.
> In general, the performance of 2.0 is starting to look good.

Cool.  This probably means the code is starting to look good, too.

> * The discussion here covers only CPU utilization.  There are other
>   aspects of performance, like multiprocessor scalability, that
>   are independent of this data.

Once we get the syscalls optimized (I'm reminded of Dean's attack
on our number of syscalls in 1.3 - I believe he went through syscall
by syscall trying to eliminate all of the unnecessary ones), I think 
the next performance point will be MP scalability (see below for
lock scalability on solaris).  But, we do need to see what we can 
do about optimizing the syscalls though...

> * find_start_sequence() is the main scanning function within
>   mod_include.  There's some research in progress to try to speed
>   this up significantly.

Based on the patches you submitted (and my quasi-errant formatting
patch), I had to read most of the code in mod_include, so I'm more 
familiar with mod_include now.  I do think there are some obvious 
ways to optimize find_start_sequence.  I wonder if we could apply 
a KMP-string matching algorithm here.  I dunno.  I'll take a look 
at it though.  Something bugs me about the restarts.  I bet that 
we spend even more time in find_start_sequence when a HTML file 
has lots of comments.  =-)

> * strlen() is called from lots of places throughout the code, with
>   the most frequent calls being from apr_pstrdup, apr_pstrcat, and
>   time-formatting functions used in apr_rfc822_date.

I think someone has brought up that apr_pstrdup does an extra strlen.
I'll have to review that code.

> * _lwp_mutex_unlock() gets from pthread_mutex_unlock(),
>   but only from a small fraction of pthread_mutex_unlock calls
>   (Can someone familiar with Solaris threading internals explain
>   this one?)

The LWP scheduler may also call _lwp_mutex_unlock() implicitly -
LWP scheduler is a user-space library so it gets thrown in
with our numbers I bet.

Here's some background on Solaris's implementation that I
think may provide some useful information as to how the locks 
will perform overall.  (If you spot any inconsistencies, it is
probably my fault...I'm going to try to explain this as best as
I can...)

First off, Solaris has adaptive locks.  Depending if the owner of 
the lock is currently active, it will spin.  If the system sees
that the owner of the held lock is not currently active, it will
sleep (they call this an adaptive lock - it now enters a turnstile).

Okay, so what happens when a mutex unlocks?  This depends on
whether you are in spin or adaptive lock.  Spin locks 
immediately see the freed lock and the first one on the CPU
grabs it (excluding priority inversion here.).

But, the more interesting issue is what happens when we are in
an adaptive lock.  According to Mauro, Solaris 7+ has a thundering 
herd condition for adaptive kernel locks.  It will wake up *all* 
waiters and lets them fight it out.  This was a difference from 
Solaris 2.5.1 and 2.6.  

Since you should never have a lot of threads sitting in a mutex 
(according to Sun, it is typical in practice to only have one 
kernel thread waiting), this thundering herd is okay and 
actually performs better than freeing the lock and only waking 
up one waiter.  They say it makes the code much cleaner.  I'm
betting we're overusing the mutexes which changes the equation
considerably.  =-)

Okay, that tells you how it is done in kernel-space.  For
user-space (see below how to remove user-space threading), 
it is slightly different.  Remember in Solaris, we have
a two-tier thread-model - user-space and kernel threads.

In user-space, when we call pthread_mutex_unlock, we hit 
_mutex_unlock in liblwp, which calls mutex_unlock_adaptive 
as we aren't a special case lock.

So, what does mutex_unlock_adaptive do in lwp?  In pseudocode (as 
best as I can explain it), it first tries to see if there is a 
waiter in the current LWP, if so, it clears the lock bit and the 
other thread in the same LWP takes it.  If no thread in the same 
LWP is waiting, it will then "sleep" for ~500 while loop 
iterations to let another thread in the same LWP take the thread.
(On a UP box, it'll exit here before doing the while loop as it 
knows that spinlocks are stupid.  I think you were testing on a 
MP box.)  If the while loop iteration concludes without anyone 
acquiring the lock, it sends it to the kernel as no one in this
LWP cares for that lock (and we hit the semantics described above).

I'm wondering if this while loop iteration (essentially a spin
lock) may be the root cause of the lwp_mutex_unlock utilization 
you are seeing.

Anyway, in Solaris 9, IIRC, they have removed the userspace 
scheduler and all threads are now bound (all threads map 
directly to kernel threads).  You may acheive the same result 
in earlier versions of Solaris by linking against 
/usr/lib/lwp/ - which is also configurable by 
changing your run-time link order.  

You will definitely see different performance characteristics
on Solaris with the two's.  I'd encourage you
to test with bound threads.  =-)

For more Solaris-specific information, I would recommend getting 
both Solaris Internals by Jim Mauro and Richard McDougall (both
are Sun Senior Engineers in the Performance Applications 
Engineering group - aka kernel freaks) and the Solaris source 
code - both are good references.  I'm looking at:


and Chapter 3 in Solaris Internals.

If anyone is interested, I can email the mutex.c to them as it is 
a very interesting read and goes into their decisions as to why 
they wake all waiters with mutex_exit.  (It's big and I'm not 
sure what the copyright restrictions are exactly...)

HTH.  -- justin

View raw message