hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Kalnichevski <ol...@apache.org>
Subject HttpClient performance with multiple threads; Re: AbstractNIOConnPool memory leak?
Date Sun, 06 Jan 2013 12:55:56 GMT
On Sat, 2013-01-05 at 15:56 -0800, Ken Krugler wrote:
> On Jan 5, 2013, at 3:31pm, vigna wrote:
> > On 5 Jan 2013, at 3:10 PM, Ken Krugler <kkrugler_lists@transpac.com> wrote:
> > 
> >> So on a large box (e.g. 24 more powerful cores) I could see using upward
> >> of 10K threads being the 
> >> optimal number.
> > 
> > We are working to make 20-30K connections work on 64 cores.
> > 
> >> Just FYI about two years ago we were using big servers with lots of
> >> threads during a large-scale web 
> >> crawl, and we did run into interesting bottlenecks in HttpClient 4.0.1 (?)
> >> with lots of simultaneous 
> >> threads. I haven't had to revisit those issues with a recent release, so
> >> maybe those have been resolved.
> > 
> > 
> > Can you elaborate on that? I guess it would be priceless knowledge :).
> 1. CookieStore access
> > For example, during a Bixo crawl with 300 threads, I was doing regular thread dumps
and inspecting the results. A very high percentage (typically > 1/3) were blocked while
waiting to get access to the cookie store. By default there's only one of these per HttpClient.
> > 
> > This one was fairly easy to work around, by creating a cookie store in the local
context for each request:
> > 
> >            CookieStore cookieStore = new BasicCookieStore();
> >            localContext.setAttribute(ClientContext.COOKIE_STORE, cookieStore);
> 2. Scheme registry
> > But I've run into a few other synchronized method/data bottlenecks, which I'm still
working through. For example, at irregular intervals the bulk of my fetcher threads are blocked
on getting the scheme registry
> I believe this one has been fixed via the patch for https://issues.apache.org/jira/browse/HTTPCLIENT-903,
and is in the current release of HttpClient.


You might want to have a look at the lest code in SVN trunk (to be
released as 4.3). Several classes such as the scheme registry that
previously had to be synchronized in order to ensure thread safety have
been replaced with immutable equivalents. There is also now a way to
create HttpClient in a minimal configuration without authentication,
state management (cookies), proxy support and other non-essential
functions. These functions are not merely disabled but physically
removed from the processing pipeline, which should result in somewhat
better performance in high threads contention scenarios, as the only
synchronization point involved in request execution would be the lock of
the connection pool. Minimal HttpClient may be particularly useful for
anonymous web crawling when authentication and state management are not

> 3. Global lock on connection pool
> Oleg had written:
> > Yes, your observation is correct. The problem is that the connection
> > pool is guarded by a global lock. Naturally if you have 400 threads
> > trying to obtain a connection at about the same time all of them end up
> > contending for one lock. The problem is that I can't think of a
> > different way to ensure the max limits (per route and total) are
> > guaranteed not to be exceeded. If anyone can think of a better algorithm
> > please do let me know. What might be a possibility is creating a more
> > lenient and less prone to lock contention issues implementation that may
> > under stress occasionally allocate a few more connections than the max
> > limits.
> I don't know if this has been resolved. My work-around from a few years ago was to rely
on having multiple Hadoop reducers running on the server (each in their own JVM), where I
could then limit each JVM to at most 300 connections.

I experimented with the idea of lock-less (unlimited) connection manager
but in my tests it did not perform any better than the standard
connection manager.

I am attaching the source code of my experimental connection manager.
Feel free to improve on it and see if produces better results for your
particular application.


> HTH,
> -- Ken
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr

View raw message