Return-Path: Delivered-To: apmail-new-httpd-archive@apache.org Received: (qmail 96818 invoked by uid 500); 9 Jul 2001 20:16:42 -0000 Mailing-List: contact new-httpd-help@apache.org; run by ezmlm Precedence: bulk Reply-To: new-httpd@apache.org list-help: list-unsubscribe: list-post: Delivered-To: mailing list new-httpd@apache.org Received: (qmail 96751 invoked from network); 9 Jul 2001 20:16:28 -0000 X-Authentication-Warning: rdu88-250-179.nc.rr.com: trawick set sender to trawick@attglobal.net using -f Sender: trawick@rdu88-250-179.nc.rr.com To: new-httpd@apache.org Subject: some restart problems with prefork From: Jeff Trawick Date: 09 Jul 2001 16:06:48 -0400 Message-ID: Lines: 55 User-Agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/20.3 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Spam-Rating: h31.sny.collab.net 1.6.2 0/1000/N Greg Ames played a bit then we discussed... symptom: parent process hangs during graceful restart parent process is hung in connect() we've already done a bunch of connects but have hundreds left to do the kernel (FreeBSD, at least) will accept only a certain number of queued connections before it blocks connect() (some details are specific to local connection) one problem: we try to do ap_daemons_limit connects... this could be hundreds more than we need, certainly enough more to cause connect to block another problem: even if we did the "right number" of connects some processes could be busy for quite a while processing old requests; we don't want to hold up the parent waiting for them to accept proposed solution: step 1: set an APR timeout on the socket used for connect; if connect fails due to timeout* then stop connecting and let the parent process go forward with the restart I don't yet know whether or not we need to write more chars to the pod even when connect() hangs. Hopefully not ('cause the pod would get cleaned up in the server and a read in the child will fail and thus signal the child to go away) but we'll see... We definitely don't want to hang more than once. We'll try a several second timeout and see how it works. We don't want to hold up the parent process long but then on a sick system it may take a while to wake everybody up. If we stop connecting then we rely on real requests to wake up servers from the old generation. step 2: figure out how many server processes are really active (it might be a bit inaccurate due to server processes going away) and write to the pod that many times *apr_connect() for Unix needs to be fixed up to handle timeouts properly comments? -- Jeff Trawick | trawick@attglobal.net | PGP public key at web site: http://www.geocities.com/SiliconValley/Park/9289/ Born in Roswell... married an alien...