httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Fritsch">
Subject Re: PR42829: graceful restart with multiple listeners using prefork MPM can result in hung processes
Date Fri, 01 Feb 2008 09:41:39 GMT
Joe Orton wrote:
> I mentioned in the bug that the signal handler could cause undefined
> behaviour, but I'm not sure now whether that is true.  On Linux I can
> reproduce some cases where this will happen, which are all due to
> well-defined behaviour:
> 1) with some (default on Linux) accept mutex types,
> apr_proc_mutex_lock() will loop on EINTR.  Hence, children blocked
> waiting for the mutex do "hang" until the mutex is released.  Fixing
> this would need some APR work, new interfaces, blah

This is not a problem. On graceful-stop or reload the processes will get
the lock one by one and die (or hang somewhere else). I have never seen a
left over process hanging in this function.

> 2) prefork's apr_pollset_poll() loop-on-EINTR loop was not checking
> die_now; the child holding the mutex will not die immediately if poll
> fails with EINTR, and will hence appear to "hang" until a new connection
> is recevied.  Fixed by

IMHO this is the same as 3), as apr_pollset_poll() will be called again
but with all fds already closed.

> I can also reproduce a third case, but I'm not sure about the cause:
> 3) apr_pollset_poll() is blocking despite the fact that the listening
> fds are supposedly already closed before entering the syscall.

This is the main problem in my experience.

> I vaguely recall some issue with epoll being mentioned before in the
> context of graceful stop, but I can't find a reference.  Colm?
> A very tempting explanation for (3) would be the fact that prefork only
> polls for POLLIN events, not POLLHUP or POLLERR, or indeed that it does
> not check that the returned event really is a POLLIN event; POSIX says
> on poll:
> " ... poll() shall set the POLLHUP, POLLERR, and POLLNVAL flag in
>  revents if the condition is true, even if the application did not set
>  the corresponding bit in events."

I also had problems under solaris 9 where processes blocked in 
lr->accept_func() if the fd had been closed in the meantime. 
Unfortunately, I cannot reproduce it now even with an unpatched 2.2.6 and
I don't remember which configuration I used. But this could be related to
the returned event not being POLLIN.

> and there's even a comment in the prefork poll code to the effect that
> maybe checking the returned event type would be a good idea.  But from a
> brief play around here, fixing the poll code to DTRT doesn't help.  I
> think more investigation is needed to understand exactly what is going
> on here.
> (Also, just to note; I can reproduce (3) even with my patch to dup2
> against the listener fds.)

On Linux with epoll, the hanging processes just blocks in
apr_pollset_poll(), so checking the return value won't do any good.

Maybe the problem is that (AIUI) poll() returns POLLNVAL if a fd is not
open, while epoll() does not have something similar. In epoll.c, a comment
says "APR_POLLNVAL is not handled by epoll". Or should epoll return
EPOLLHUP in this case?


View raw message