httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dean gaudet <dgaudet-list-new-ht...@arctic.org>
Subject Re: Problem found in perform_idle_server_maintenance on prefork.
Date Wed, 28 Feb 2001 06:38:30 GMT
yay asynchronous notification!  it sure sucks :)

On Tue, 27 Feb 2001, Paul J. Reder wrote:

>   1) Don't allow perform_idle_server_maintenance to clean up children in the Starting
state.
>         Kill the highest numbered child not in the S state.
>   2) Look into closing the timing window that allows this to happen. Continue allowing
state
>         S children to be killed.
>   3) Keep track of the last pid killed. If it is the same pid N times in a row, change
its
>         state to Zombie and kill the next pid down. Log the presence of a zombie for
debugging
>         purposes. Don't try to kill Zombies.
>   4) Something else more appropriate?

1 and 3 just hide the problems... you could have trouble any time SIGWINCH
arrives during the alloc_mutex critical sections.

a more accurate fix would be something like this:

- add apr_lock_try() which behaves like pthread_mutex_trylock() does
- use apr_lock_try() in the signal handler.
  if it succeeds, continue with cleanup
  if it fails, then set a global indicating deferred cleanup
    also block SIGWINCH before returning from the sighandler
- after every instance of apr_lock_release(alloc_mutex), test the global
  and if deferred cleanup is indicated then proceed with the deferred
  cleanup

but that alone is just the tip of the iceberg.

take a look at ap_popenf() in apache-1.3.  notice that it starts with
ap_block_alarms().

ok now go look at apr_file_open() in apache-2.0.  notice that it opens a
file and then registers the cleanup.  notice how if SIGWINCH arrives at
the wrong time, the cleanup won't have been registered.

big deal it's just a file... i pretty much agree on this resource.  but
i'm pointing it out more as a fundamental problem.  essentially anything
which doesn't have access to the alloc_mutex can't guarantee atomicity
with respect to SIGWINCH.  might as well not have cleanups at all, or just
not expect them to be called when the process is killed.

(alloc_mutex isn't the only way to solve it, you can do just what 1.3's
ap_block_alarms do.)

i'm pretty sure SIGWINCH is just the SIGUSR1 replacement right?  it tells
the child to die as soon as is convenient.  so the process really only
needs to die off after it's fully initialised.  if so then all of the
above is avoided by using signal handlers in the only way they're safe to
use:  to tweak a global variable.  (1.3's USR1 handler is almost that
simple.)

and actually that graceful die flag might as well live in the scoreboard.
i can't remember why the graceful die flag doesn't live in the scoreboard
in 1.3... maybe i just never thought of it.  hrm.  i bet i just didn't
bother to move it there because 1.3 code had to be signal aware anyhow.
you may want to move this to the scoreboard in 2.0 to eliminate the
SIGWINCH in the children, it'll mean fewer potential problems with
non-signal aware 3rd party code.

almost all of these types of race conditions in 1.2 were discovered by
doing "while 1; kill -HUP" or kill -USR1 at the same time the server had a
load running against it, and i was browsing pages.  (the same test helped
shake out ancient linux SMP kernel bugs too.)

-dean


Mime
View raw message