Mailing-List: contact new-httpd-help@apache.org; run by ezmlm
Precedence: bulk
Reply-To: new-httpd@apache.org
Sender: trawick@rdu88-250-179.nc.rr.com
To: new-httpd@apache.org
Subject: some restart problems with prefork
From: Jeff Trawick <trawick@attglobal.net>
Date: 09 Jul 2001 16:06:48 -0400
Message-ID: <m37kxhlvdz.fsf@rdu88-250-179.nc.rr.com>
Lines: 55
User-Agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/20.3
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii

Greg Ames played a bit then we discussed...

symptom:

parent process hangs during graceful restart

parent process is hung in connect()
we've already done a bunch of connects but have hundreds left to do

the kernel (FreeBSD, at least) will accept only a certain number of
queued connections before it blocks connect() (some details are
specific to local connection)

one problem:

we try to do ap_daemons_limit connects...  this could be hundreds more
than we need, certainly enough more to cause connect to block

another problem:

even if we did the "right number" of connects some processes could be
busy for quite a while processing old requests; we don't want to hold
up the parent waiting for them to accept

proposed solution:

step 1:
set an APR timeout on the socket used for connect; if connect fails
due to timeout* then stop connecting and let the parent process go
forward with the restart

I don't yet know whether or not we need to write more chars to the pod
even when connect() hangs.  Hopefully not ('cause the pod would get
cleaned up in the server and a read in the child will fail and thus
signal the child to go away) but we'll see...  We definitely don't
want to hang more than once.

We'll try a several second timeout and see how it works.  We don't
want to hold up the parent process long but then on a sick system it
may take a while to wake everybody up.  If we stop connecting then we
rely on real requests to wake up servers from the old generation.

step 2:
figure out how many server processes are really active (it might be a
bit inaccurate due to server processes going away) and write to the
pod that many times

*apr_connect() for Unix needs to be fixed up to handle timeouts
properly

comments?
-- 
Jeff Trawick | trawick@attglobal.net | PGP public key at web site:
       http://www.geocities.com/SiliconValley/Park/9289/
             Born in Roswell... married an alien...