Return-Path: Delivered-To: apmail-httpd-dev-archive@www.apache.org Received: (qmail 19861 invoked from network); 13 Aug 2004 20:49:00 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 13 Aug 2004 20:49:00 -0000 Received: (qmail 67071 invoked by uid 500); 13 Aug 2004 20:48:53 -0000 Delivered-To: apmail-httpd-dev-archive@httpd.apache.org Received: (qmail 67022 invoked by uid 500); 13 Aug 2004 20:48:53 -0000 Mailing-List: contact dev-help@httpd.apache.org; run by ezmlm Precedence: bulk Reply-To: dev@httpd.apache.org list-help: list-unsubscribe: list-post: Delivered-To: mailing list dev@httpd.apache.org Received: (qmail 67007 invoked by uid 99); 13 Aug 2004 20:48:53 -0000 X-ASF-Spam-Status: No, hits=0.2 required=10.0 tests=SEE_FOR_YOURSELF X-Spam-Check-By: apache.org Received: from [12.110.146.15] (HELO exchangefe1.netegrity.com) (12.110.146.15) by apache.org (qpsmtpd/0.27.1) with SMTP; Fri, 13 Aug 2004 13:48:49 -0700 Received: From maex04.netegrity.com ([172.26.11.238]) by exchangefe1.netegrity.com (WebShield SMTP v4.5 MR1a P0803.345); id 1092430127304; Fri, 13 Aug 2004 16:48:47 -0400 Received: by maex04 with Internet Mail Service (5.5.2656.59) id ; Fri, 13 Aug 2004 16:48:47 -0400 Message-ID: From: "Arliss, Noah" To: dev@httpd.apache.org Cc: "Mary, Dave" Subject: RE: [PATCH] fix child reclaim timing Date: Fri, 13 Aug 2004 16:48:42 -0400 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2656.59) Content-Type: text/plain X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N I'd like to comment further... Not only is a disturbing message sent to the error log, but a SIGTERM is also sent to the child process. If I understand correctly the SIGTERM will likely interrupt any properly implemented child process shutdown and the child process will exit ungracefully. If it's acceptable to wait longer then the kill call should also be postponed to give modules a chance to cleanup gracefully. If any module has complex IPC or Mutexes in use, graceful shutdown is important especially if MaxRequestsPerChild is in use on a server with heavy load. -Noah -----Original Message----- From: Jeff Trawick [mailto:trawick@gmail.com] Sent: Friday, August 13, 2004 10:27 AM To: dev@httpd.apache.org Subject: Re: [PATCH] fix child reclaim timing On Fri, 13 Aug 2004 14:51:23 +0100, Joe Orton wrote: > The 2.0 ap_reclaim_child_processes logic seems to be broken - it never > resets the waittime variable as it did in 1.3; so the parent will wait > for up to 23 minutes (sic) in total for a stuck child process. (SIGSTOP > a child and strace the parent to see for yourself) > > This updates the logic to be a little more sane: > > - at t + 16, 82, 344 ms, just waitpid() > - at t + 425, 688, 1736 ms, waitpid() else SIGTERM the child > - at t + 1.74 secs, waitpid() else SIGKILL the child > - at t + 1.75, 1.82 secs, just waitpid() > - at t + 2.08 secs, waitpid() else log "this child won't die" > > Any comments? Here is my take on what is wrong with current code: 1) It starts complaining a bit too soon. Some third-party modules have rather complicated child exit strategies. Whether or not that is good or bad (bad ;) ), it results in disturbing messages that wouldn't have appeared if we were a little more patient (2-3 seconds). Also, I suspect that the use of threaded MPM affects how quickly the children are exiting now on Unix. 2) It should never stop checking for exited processes less often than 1-2 seconds, even if it doesn't complain to error log that often. Like you say, current code can wait a VERY long time for child processes to exit. In practice, I see that it can wait a VERY long time even after the last child has exited. I'll agree that it should never wait so long, though I think around 15 or so seconds total is reasonable. Exiting before children are gone doesn't let Apache start up any more quickly; it just prevents potentially-useful information about timing from getting logged to the error log. --/-- I wouldn't complain to error log at all until it has been 2 seconds, and then I'd still wait around for 10-15 more. But it has to check every second so it finds out soon after all children have exited and doesn't sleep needlessly.