httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Ames <>
Subject Re: accept mutex failure causes fork bomb
Date Tue, 15 Sep 2009 18:02:25 GMT
On Tue, Sep 15, 2009 at 8:07 AM, Jeff Trawick <> wrote:

> On Mon, Sep 14, 2009 at 4:27 PM, Greg Ames <> wrote:
>> I'm trying to debug a problem where apparently the accept mutex went bad
>> on a z/OS system running the worker MPM.  I'm guessing that some memory that
>> we use for the semaphore got clobbered but don't have proof yet.  The error
>> log looks like:
>> [Mon Sep 07 08:01:59 2009] [emerg] (121)EDC5121I Invalid argument.:
>> apr_proc_mutex_unlock failed. Attempting to shutdown process gracefully.
> Could it be some system limit exceeded, like max waiters on a mutex?  (IIRC
> some OSs require tuning of SysV semaphores.  (Older Solaris comes to mind.))

it's possible.  but the very first error was for unlock, so the lock must
have worked.  I would think most of the SysV tuning stuff is more critical
for getting a lock than releasing it.

> * Should we do clean_child_exit(APEXIT_CHILDSICK or CHILDFATAL) for this
>> error?  We have a previous fix to detect accept mutex failures during
>> restarts and tone down the error messages.  I don't recall seeing any false
>> error messages after that fix.  We could also use requests_this_child to
>> detect if this process has ever successfully served a request, and only do
>> the clean_child_exit if it hasn't.
> So nasty failures prior to successfully accepting a connection bypass
> squatting?  Good.

yep, that's the idea.  any of the exit() calls should bypass setting
ps->quiescing and disable squatting.

> CHILDSICK or CHILDFATAL in that case?  In this example it probably wasn't
> going to get any better.  However, I think it is reasonably likely that
> child process n+1 encounters some sort of resource limit, so CHILDSICK seems
> better.

I agree.  we have existing logic in the parent which will shut down the
server if it can't find any healthy worker threads.

>  * Should we yank the squatting logic?  I think it is doing us more harm
> than good.  IIRC it was put in to make the server respond faster when the
> workload is spikey.  A more robust solution may be to set Min and
> MaxSpareThreads farther apart and allow ServerLimit to be enforced
> unconditionally.  disclaimer: I created ps->quiescing, so I was an
> accomplice.
> My understanding is that squatting is required to deal with long-running
> requests that keep a child trying to exit from actually exiting, thus tying
> up a scoreboard process indefinitely.

you are right.  thanks for the reminder.

> Is it reasonable to have up to MaxClients worth of squatting like we have
> now?  (I think that is what we allow.)  No, I don't think so.

I think it's worse than that.  it's more like (MaxClients - threads in fully
active processes) / (avg. number of threads that p_i_s_m thinks are healthy
per quiescing process).  so if we have an average of one thread per
quiescing process downloading a DVD over dialup, you are exactly right.  if
we have zero healthy threads in quiescing processes because the worker
threads have exited but the process isn't completely gone, the denominator
can get pretty small and there isn't any limit on squatting.

> Should we axe squatting, respect ServerLimit, and thus make the admin raise
> ServerLimit

or make the admin increase the difference between MinSpareThreads and
MaxSpareThreads to reduce the oscillations between shutting down worker
processes and forking more.

> to accommodate exiting processes which are handling long-running requests
> (and waste a bit of shared memory at the same time)?  Maybe ;)  That change
> seems a bit drastic, but I'm probably just scared of another long period of
> time before I halfway understand how it behaves in the real world.

yeah this is a sensitive and difficult to grok piece of code.  I hope
anything we do here simplifies it.  I don't want to have to write the doc
for how ServerLimit and squatting interact.


View raw message