Mailing-List: contact dev-help@apr.apache.org; run by ezmlm
Precedence: bulk
Date: Sun, 16 Sep 2001 19:59:19 -0700
From: Aaron Bannert <aaron@clove.org>
To: dev@apr.apache.org
Subject: Re: [proposal] apr_thread_setconcurrency()
Message-ID: <20010916195919.P11014@clove.org>
References: <20010914154448.V11014@clove.org>
 <20010914154959.I12417@ebuilt.com> <20010914162151.Z11014@clove.org>
 <20010914183347.K12417@ebuilt.com> <20010915164339.H11014@clove.org>
 <20010916005510.L12417@ebuilt.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <20010916005510.L12417@ebuilt.com>;
 from jerenkrantz@ebuilt.com on Sun, Sep 16, 2001 at 12:55:10AM -0700

On Sun, Sep 16, 2001 at 12:55:10AM -0700, Justin Erenkrantz wrote:
> I'm saying that it should never be used.  Simple.  You can't use
> that call properly in any real-world case - just like I don't think 
> you should call sched_yield ever.  You are attempting to solve a 
> problem that is best solved somewhere else - the base operating 
> system.

I aim to prove that there are cases where it is useful. I do not
think that sched_yield should be used, but that's a whole different
story (but I do think we should have a thread_yield for the sake
of netware and other totally userspace thread implementations
-- not to stir up the fire any more ;)

> The testlock case doesn't matter because it never hits any of the 
> Solaris-defined entry points.  This is a quirk in the OS and I see 
> no reason to work around it.  If you want to make testlock do the 
> right thing with the Solaris LWP model, use a reader/writer lock
> to synchronize the starting of the threads.  This way you guarantee 
> that all threads are started before you start execution of the 
> tight exclusive loop (which is something that testlock doesn't do 
> now).  You are assuming that the threads are created in parallel -
> nowhere is that ordering is guaranteed.

I don't think it's a quirk of the thread library, I think it's
fully expected. For the sake of others, here's an excerpt from the
Solaris 8 pthread_setconcurrency(3THR) man page:

DESCRIPTION
     Unbound threads in a process may or may not be  required  to
     be  simultaneously active. By default, the threads implemen-
     tation ensures that  a  sufficient  number  of  threads  are
     active  so  that  the process can continue to make progress.
     While this conserves system resources, it  may  not  produce
     the most effective level of concurrency.

     The  pthread_setconcurrency() function allows an application
     to  inform  the  threads  implementation of its desired con-
     currency level, new_level. The actual level  of  concurrency
     provided  by the implementation as a result of this function
     call is unspecified.

...

Although that is a very vague description of the mechanics of this
call, it does make it clear that the initial settings may not
be desired in all cases.

> > In consideration of your statement here I spend some time reading
> > the Solaris 8 libpthread source. On that platform your statement
> > here is false. Calling pthread_setconcurrency (or thr_setconcurrency
> > for that matter) can only change the number of multiplexed LWPs in
> > two ways: either not at all, or by increasing the number. I see
> > no way that it acts as a ceiling.
> 
> Yes, you are correct and I was wrong - I reread the Solaris Internals 
> book on my flight back to LAX today.  It isn't a ceiling.  However, 
> the case of creating too many LWPs is completely valid and is brought 
> up many times in their discussion of LWPs versus a bound thread model.
> Kernel threads are very expensive in Solaris and part of the reason 
> that it handles threads well is because it multiplexes the kernel 
> threads efficiently.  No other OS I have seen handles threads as
> gracefully as Solaris.

Creating too many LWPs may be a problem, and is something I intend
on looking into. I do however feel this is something the application
writer is going to have to deal with case-by-case.

In my experimentations with setconcurrency I have arrived at some
conclusions (*on Solaris8):

- setconcurrency(0) has not affect on the number of LWPs.
- setconcurrency(n) will create new LWPs if n > current_num_lwps
                     else it will have no effect on the number of LWPs.
- if you set it too high, you will suffer performance
- if you set it too low, you will either not take advantage of other CPUs,
  or you will not see it migrate tohe load to other CPUs until the "LWP
  creation agent" decides it's time to do so.


> I believe SUSv2 called it a "hint" for the general case.  However, in 
> this specific implementation (multiplexed kernel threads), it is not 
> a hint.  It is a request to have that many LWPs.  If you disagree
> with that statement, please look at the code again.

I was very clear in my previous message, and I have restated it in the
above statement. I was refuting the comment you made saying it was a
"command" and not a "hint". It is indeed a hint, only that in the
case where you ask for more LWPs than are currently allocated, it
will *attempt* to create more. In _all other cases_ it will simply
ignore the number you give it. It is not a ceiling.

> I pointed out that number (simultaneous requests) is a completely 
> bogus number to use when dealing with multiplexed kernel threads.
> This poor choice is why I don't think this call belongs in APR at all.  
> If you would care to claim that the number of simultaneous requests is
> the correct number in the context of a multiplexed thread model for
> worker, I would be delighted to hear why - you haven't offerred any 
> proof as to its validity.  I indicated why I thought that number was
> wrong.  I'll repeat it again with a bit more of a technical 
> explanation.

As I said at the beginning of this thread, I'd like to use this
call in more places than the worker MPM. I am not sure if this
will provide a benefit to the worker MPM, but if it does than
that is a good starting place.

> Creating all user threads as bound (what you are suggesting for 
> worker by calling pthread_setconcurrency with that value) in a 
> multiplexed thread model works against the thread model rather than 
> with it - this indicates a clash in design.  You want a bound thread 
> library, but refuse to use a bound thread library.  

It's actually worse than creating them as bound. In most cases a bound
thread has an early exit point to the system call in the userspace
implementation. Having a pool of LWPs available to a group of userspace
threads means that they have to be assigned. Bound means you get one
LWP forever.

> Ideally, most of worker MPM's time will be spent dealing with I/O, so
> there is no need to have spurious kernel threads when in such a usage
> pattern.  Solaris has a number of safeguards that will ensure that any
> runnable thread (kernel or user) will run as quickly as it can and it 
> will only create as many kernel threads as are actually dictated by
> the load (if there are really 8 threads ready to run, 8 execution
> contexts will be available).
> 
> With "scheduler activations" (Solaris 2.6+), when a user thread is 
> about to block and other user threads are waiting to execute, the
> running LWP will pass that unbound (but now blocked) thread off to 
> an idle LWP (via doors).  If no free LWPs are available (all LWPs 
> are blocked or executing), a new LWP is spawned (via SIGWAITING) 
> and the now-blocked unbound user thread is transferred.
> 
> This blocked user thread will resume via what Solaris calls "user 
> thread activation" - shared memory and a door call which indicates to 
> the kernel thread when a user thread is ready for execution (i.e. 
> needs the LWP active now because whatever blocked it has now been
> unblocked).  So as soon as the message is sent, the kernel will 
> reschedule the appropriate LWP.
> 
> Okay, back to the original LWP that the user thread was on - it has 
> time left on its original quantum because its user thread was about 
> to end prematurely, it then searches for a waiting unbound thread to
> execute in the remainder of its time.
> 
> In the common case of a user thread blocking with a free LWP already 
> created, you have saved a kernel context switch (the running LWP 
> sticks the user thread in an idle LWP by itself) - this is why this 
> M*N implementation can be faster than bound threads.  The context
> switch is free and the responsiveness is thus higher.  This also 
> causes it to create kernel threads as needed.  
> 
> The entire idea of a multiplexed kernel thread model (such as 
> Solaris) is to minimize the number of actual kernel threads and 
> increase responsiveness.  You would be circumventing that 
> decision by creating bound kernel threads that may not be 
> actually required due to the actual execution pattern of the code.  
> You will also decrease responsiveness because switching between 
> threads now becomes a kernel issue rather than a cheap user-space 
> issue (which is what Solaris wants to do by default).  However,
> you do this in a library that was optimized for mulitple 
> user-space threads not bound threads.
> 
> I believe if you really want a bound thread implementation, you should
> tell the OS you want it - not muck around with an indeterminate API to 
> do so that directly circumvents the scheduling/balancing process.

I don't want a bound thread impl, or I would have done that with the
thread attribute at creation time. I want the threads to ramp up fast
and I want them to migrate to other CPUs quickly.

> > There you go again with this "OS scheduler" thing that I've never heard
> > of. 10 seconds to stabilize is rather long when you consider I have
> > already served O(5000) requests.
> 
> You are really attempting to make this a personal argument here by
> attacking me.  I think this is completely uncalled for and 
> inappropriate.

I apoligise for the more snide comments made in my previous message.
They were perhaps inappropriate in this forum. I do however expect
this discussion to narrow in on the facts and come to a rational
conclusion instead of lingering on vague undefined concepts.

> 10 seconds isn't a long time for a server that will be up for months 
> or years.  And, as you said, you pulled that number (10 seconds) out 
> of thin air.  If you can substantiate it with real results, please
> provide them.  I don't consider a case of a 10 second delay for the 
> OS to properly balance itself with a particular thread model an issue.
> And, what is the impact of not having enough LWPs initially?  Were
> you testing on a SMP or UP box?  What was the type of CPU load that
> was being performed before it was balanced (usr, sys, or iowait)?

Unfortunately it may not be true that the server will be up for months or
years. In the best of cases we can hope for a MaxRequestsPerChild to be
infinite, but the reality is that 3rd party modules (and even httpd)
may leak memory. IIRC, the default MaxRequestsPerChild is 10000.
If it is taking me 5000 requests to reach a steady state, we are spending
half our time trying to ramp up before having to start all over again.

> You also haven't mentioned how many LWPs it stabilized at after
> 10 seconds?  Did Solaris choose to add a LWP for each user thread?  
> I have a feeling it wouldn't, but I may be wrong.  -- justin

I'll follow up this reply with some real numbers.

-aaron