apr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Querna <c...@force-elite.com>
Subject Bug in epoll
Date Wed, 17 Jan 2007 03:41:57 GMT
Hey All,

I am observing a bug in apr_pollset_poll.

What I am seeing is this:
1) About ~140 sockets to different machines added to the pollset in
apr_memcache_multgetp() watching for read availability.
2) Write memcache requests to the servers
3) Start _poll()'ing for data:
  a) The first couple sockets come back within a few milliseconds, and
are read correctly.
  b) The next time apr_pollset_poll is called, it does return, but only
a SINGLE socket is marked as available, and it waits to within 1
millisecond of the TIMEOUT value. This single socket is read correctly.
  c) The next time apr_pollset_poll is called, it behaves like normal,
     returning multiple results, in a very short time period.

The pattern of a,b,c sometimes repeats multiple times before all of the
data has been received from the servers.

Other notes:
- This is in a single threaded client, so there is no cross locking of
the linked lists from _add or _remove in the pollset.
- OS is RHEL 4 update 2.
- This is 99.9% reproducible in a large scale test and production

The most interesting aspect to me is that if I compile APR using poll()
instead of epoll() as the apr_pollset backend, the exact same code works
great, with no extra delay. (just pass apr_cv_epoll=no to your
./configure line).

I googled'^H^H^H^H^H^Hsearched around, and wasn't able to find mention
of a bug like this.

To me, the non-kernel programmer, it looks like epoll is only getting
triggered on the wakeup timer for the timeout, and not returning
instantly when it has found a socket available for read.  When it
finally does hit the timeout wakeup, it does notice that there is a
socket available to read, and returns it, rather than an actual timeout.

For the short term, I am satisfied with disabling epoll on my builds of
APR.  I think we should consider disabling epoll by default on APR, if I
can isolate the bug to a kernel revision.

Any ideas or pointers to epoll bugs|fixes would be great.....


View raw message