Mailing-List: contact dev-help@httpd.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@httpd.apache.org
Message-ID: <ED8D446688F0EF45A7BCA928FB86F98277DDB3@MERCURY.netegrity.com>
From: "Arliss, Noah" <narliss@netegrity.com>
To: dev@httpd.apache.org
Cc: "Mary, Dave" <dmary@netegrity.com>
Subject: RE: [PATCH] fix child reclaim timing
Date: Fri, 13 Aug 2004 16:48:42 -0400
MIME-Version: 1.0
Content-Type: text/plain

I'd like to comment further... Not only is a disturbing message sent to the
error log, but a SIGTERM is also sent to the child process. If I understand
correctly the SIGTERM will likely interrupt any properly implemented child
process shutdown and the child process will exit ungracefully. If it's
acceptable to wait longer then the kill call should also be postponed to
give modules a chance to cleanup gracefully. If any module has complex IPC
or Mutexes in use, graceful shutdown is important especially if
MaxRequestsPerChild is in use on a server with heavy load.

-Noah

-----Original Message-----
From: Jeff Trawick [mailto:trawick@gmail.com] 
Sent: Friday, August 13, 2004 10:27 AM
To: dev@httpd.apache.org
Subject: Re: [PATCH] fix child reclaim timing

On Fri, 13 Aug 2004 14:51:23 +0100, Joe Orton <jorton@redhat.com> wrote:
> The 2.0 ap_reclaim_child_processes logic seems to be broken - it never
> resets the waittime variable as it did in 1.3; so the parent will wait
> for up to 23 minutes (sic) in total for a stuck child process.  (SIGSTOP
> a child and strace the parent to see for yourself)
> 
> This updates the logic to be a little more sane:
> 
> - at t + 16, 82, 344 ms, just waitpid()
> - at t + 425, 688, 1736 ms, waitpid() else SIGTERM the child
> - at t + 1.74 secs, waitpid() else SIGKILL the child
> - at t + 1.75, 1.82 secs, just waitpid()
> - at t + 2.08 secs, waitpid() else log "this child won't die"
> 
> Any comments?

Here is my take on what is wrong with current code:

1) It starts complaining a bit too soon.  Some third-party modules
have rather complicated child exit strategies.  Whether or not that is
good or bad (bad ;) ), it results in disturbing messages that wouldn't
have appeared if we were a little more patient (2-3 seconds).  Also, I
suspect that the use of threaded MPM affects how quickly the children
are exiting now on Unix.

2) It should never stop checking for exited processes less often than
1-2 seconds, even if it doesn't complain to error log that often. 
Like you say, current code can wait a VERY long time for child
processes to exit.  In practice, I see that it can wait a VERY long
time even after the last child has exited.

I'll agree that it should never wait so long, though I think around 15
or so seconds total is reasonable.  Exiting before children are gone
doesn't let Apache start up any more quickly; it just prevents
potentially-useful information about timing from getting logged to the
error log.

--/--

I wouldn't complain to error log at all until it has been 2 seconds,
and then I'd still wait around for 10-15 more.  But it has to check
every second so it finds out soon after all children have exited and
doesn't sleep needlessly.