httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Candler <>
Subject [PATCH] Race condition with CGI reaping under Solaris
Date Tue, 18 Mar 2003 17:24:23 GMT

For a while I have been noticing performance problems with CGI scripts
running under Apache (1.3.27) and Solaris (2.8). These are "normal" forked
CGI scripts: not FastCGI, mod_perl etc.

The symptom is that very often there is a three-second delay between
requesting the page and getting the completed response. It's actually
reasonably easy to replicate. Take a simple CGI script like this:

echo "Content-Type: text/html"
echo ""
echo "Hello from shell"

Hit the webserver with 50 requests straight after each other for this page,
and I find that almost every alternate request takes 3 seconds. However,
with this example, trussing the process which answers the requests is enough
to make the problem go away (hinting at a race condition).

Running the equivalent script in Ruby did allow me to get a truss on it and
still fail occasionally, which showed me:

12458:  24.7093 waitid(P_PID, 12472, 0xFFBEF7E8, WEXITED|WTRAPPED|WNOHANG) = 0
12458:  24.7096 kill(12472, SIGTERM)                            = 0
12458:  24.7104 alarm(3)                                        = 0
12458:  sigsuspend(0xFFBEF7F8)          (sleeping...)
12458:  27.7048     Received signal #14, SIGALRM, in sigsuspend() [caught]
12458:  27.7051 sigsuspend(0xFFBEF7F8)                          Err#4 EINTR

It then turns out that you can replicate the program 100% reliably by
introducing a small delay after closing stdin/stdout but before exiting:

puts "Content-Type: text/html"
puts ""
puts "Hello from Ruby"
sleep 0.2

>From looking through the code, I believe the problem lies in free_proc_chain
in src/main/alloc.c. The logic is:

- call waitpid() for each process to be cleaned up, to see if it has
  already died
- if it hasn't, then send it a kill SIGTERM
- if kill returns 0 (meaning the process is still alive) then sleep 3 seconds
  then send it a SIGKILL

The race condition seems to as follows: when the CGI process has finished
its work, it closes stdin/stdout which completes the request. However it
takes a short period of time to tidy up after itself before it finally dies,
by which time Apache has already decided to send it a SIGTERM and is
committed to a 3 second wait.

This is the first time I have dug around the internals of Apache and I am
not certain as to what is _supposed_ to happen in this case; in particular I
don't know if the sleep(3) is meant to be interruptible by a SIGCHLD.
However if it is, it certainly isn't happening on my platform.

My proposed solution is to poll waitpid() a few times before deciding that
the process needs to be sent a TERM. I have put together a rough patch as
proof-of-concept (attached) which does this at intervals of 10ms, 20ms,
40ms, 80ms, giving a total potential wait of 150ms. Maybe it should have
another iteration to make 310ms. It's not particularly pretty but I wanted
to localise my changes as far as possible.

Given my lack of knowledge of the Apache codebase I can't say whether this
is The Right Way[TM] to deal with this problem, but it's certainly made a
huge improvement here. Comments gratefully received.

If this is the right approach, perhaps one of the main Apache developers
would like to produce a tidier patch?


Brian Candler.

View raw message