httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Erenkrantz <>
Subject Re: weird problem on Solaris 2.6 - signals / shell ??
Date Wed, 30 May 2001 07:15:21 GMT
Hi gang,

A few weeks ago, Greg Ames posted a problem that he saw at IBM
regarding SIGTERM, piped logs, and Solaris (he said 2.6, but turns out
to be seen on Solaris 2.7 here at eBuilt).  

For Greg's original message, see:

The problem was initially reported against IBM Web Server, but I've now 
been able to reproduce this with Apache 1.3.21dev (current).

And, I think I've found the bug.  At least, I think I have.  This one
isn't easy to explain.  So, *please* bear with me.  If you have any
questions, email me and I'll try to explain further.  I'm not exactly
sure what is going on or what the proper long-term solution is.  The
short-term solution is to avoid using piped ErrorLog on Solaris with
more than a few piped TransferLogs.  There just seems to be some 
horrendously odd clashes between the two options causing some race
conditions.  Either one by themselves are fine, but together causes 
problems in the shutdown sequence leading to the parent httpd
not exiting properly (and spawning a few new rotatelogs children).

The problem appears to be with the initialization of the piped log 
for the ErrorLog directive *and* the shutdown process in
reclaim_child_processes.  When an ErrorLog is piped, it is opened by
error_log_child.  Notice that this is different than the
piped_log_spawn - my gut tells me if that we make error_log_child
similar to piped_log_spawn, this entire problem might go away.  But, 
I'm not familiar enough with the other_child code in 1.3 to tell for
certain.  This also looks like fairly old code (rse in '96 and
dgaudet in '97 based on cvs annotate).

The symptoms that I see on Solaris are thus:

1) The rotatelogs process for ErrorLog directive loses its parent
   after startup.  Hence, it's ppid is init.  I'm not sure how this
   plays in, but this doesn't look right.  I also can't recreate 2
   without having ErrorLog be a piped log.

2) On shutdown (via SIGTERM), a race condition occurs.  
   reclaim_child_processes kills all of the httpd children (fine),
   but the second half of the r_c_p call is a bit odd.  Basically,
   it calls piped_log_maint with OC_REASON_DEATH - this triggers
   piped_log_spawn to start up a new child since pl->program isn't 
   NULL.  This is completely and utterly wrong.  We're supposed to
   be shutting down, not starting up.  These new children never
   receive the SIGTERM - they will stick around until another
   SIGTERM occurs.  The old children have already quit due to the
   SIGTERM, but the piped_log_spawn starts up new rotatelogs 

Take away the ErrorLog directive - no visible problems.  Don't have 
any TransferLog directives?  No visible problem.  I think that the 
ErrorLog code is broken and we're hitting some sort of signal race 
condition in reclaim_child_processes for the TransferLog.  How are 
these interacting?  I haven't a foggy clue.


ErrorLog "| /foo/apache-1.3/bin/rotatelogs /foo/apache-1.3/logs/piped_error_log 86400"

causes the problem on Solaris.  You'll probably need a few TransferLog
directives to cause the inadvertant fork().  When you send SIGTERM to
the parent httpd process, new rotatelogs are created and hence the
parent will never quit (these new children didn't receive the SIGTERM).

I've attached the following things to maybe help others understand:

1) Truss output of parent httpd process receiving SIGTERM.
   Notice the fork within the waitpid calls.  I later killed off
   the rotatelogs and then sent SIGTERM again (it terminated).
2) gdb stack trace of reclaim_child_processes
3) ps listing showing that there is a rotatelogs with 1 as its

Am I right in my analysis?  Does this make any sense?  What is the 
proper fix to this?  Can anyone reproduce this on non-Solaris
machines?  -- justin

View raw message