hc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Idzerda, Edan" <Edan_Idze...@PremierInc.com>
Subject Re: How to tell if Async dispatcher thread is busy?
Date Wed, 14 Dec 2016 03:38:06 GMT

> On Dec 13, 2016, at 4:59 AM, Oleg Kalnichevski <olegk@apache.org> wrote:
> On Mon, 2016-12-12 at 21:15 +0000, Idzerda, Edan wrote:
>> Hello!  Our reverse proxy uses the Async Client pool to handle connections to backend
servers.  We've been tracking a problem for a while where we observe the initial TCP connection
is made, but no thread is available to handle the SSL setup before a 10 second timeout expires.
 We get into trouble because some of our backend servers are very slow, and some of our clients
download very slowly.
>> I'm experimenting with a patch to AbstractMultiworkerIOReactor.addChannel() to determine
whether the next dispatcher thread is "busy."  My first try was to look at bufferedSessions
from the BaseIOReactor, and go through the list of dispatchers one time to see if I can find
a free one.
>>        int i = Math.abs(this.currentWorker++ % this.workerCount);
>>        for (int j = 0; j < this.workerCount; j++) {
>>            if (this.dispatchers[i].getSessionCount() == 0) {
>>                break;
>>            }
>>            i = Math.abs(this.currentWorker++ % this.workerCount);
>>        }
>>        this.dispatchers[i].addChannel(entry);
>> This seems to help us in MOST of the cases we see this issue in production, but there
still seem to be a small number of threads which collide.  I'm testing a different version
which looks at AbstractIOReactor "sessions" to determine thread busy state, but it never seems
to show more than "1" session if I look at the size after piling up slow connections on top
of each other.
>> I have two questions:
>>    Is there a better way to determine whether a thread is busy?
>>    Would you be willing to accept a patch to make the dispatchers array in AbstractMultiworkerIOReactor
"protected" so I can implement my own ConnectingIOReactor that overrides addChannel() with
my own thread selection model?
>> Thanks a lot for your help and for providing such a great library to the community!
>> - edan
> Hi Edan
> What I do not quite understand is why i/o dispatch threads get blocked
> for 10 seconds or longer. This sounds awfully suspicious.
> I could imagine exposing the list of i/o dispatchers to subclasses of
> AbstractMultiworkerIOReactor in 4.4.x branch but would rather prefer to
> keep it as a last resort.
> Oleg

Thanks..  I would prefer not to have to patch httpcore-nio like this if I could work out the
root cause.  Since I am still seeing connections failing to complete SSL within 10 seconds
with my first patch (above), I am trying a new one now that uses an AtomicInteger for currentWorker.
 We are seeing far less connection problems with the patch, but there are still enough apparent
thread selection collisions that some requests fail.

The only way I have been able to reproduce this problem is by using an artificially rate limited
connection (ex, curl --limit-rate 1m) and downloading a relatively large file.  If I use a
small file, say 50K, I notice that the dispatchers thread do not get stuck. I can download
more files than I have worker threads, and AbstractIOReactor’s “sessions” set count
stays at 0.  With a larger file, like 500k, the sessions size goes to 1, and I can only download
the same number of files as I have worker threads.

Does this make any sense to you?  Is it possible the higher level proxy library is hanging
on to the HttpResponse’s Entity too long?  I see they call HttpEntity.getContent() and create
an InputStream out of it… But why would that make a worker thread become non-responsive
until it finishes?   I see a note on IOEventDispatch suggesting that “all methods of this
interface are executed on the dispatch thread of the I/O reactor … it is important that
processing that takes place in the event methods will not block the dispatch thread for too
long, as the I/O reactor will be unable to react to other events”

Is that worth pursuing?  Any suggestions on how to debug this would be appreciated!

- edan

View raw message