hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: AbstractNIOConnPool memory leak?
Date Sat, 05 Jan 2013 23:44:03 GMT

On Jan 5, 2013, at 3:36pm, Oleg Kalnichevski wrote:

> On Sat, 2013-01-05 at 22:11 +0000, sebb wrote:
>> On 5 January 2013 21:33, vigna <vigna@di.unimi.it> wrote:
>>>> But why would you want a web crawler to have 10-20K simultaneously
>>>> opened connections in the first place?
>>> 
>>> (I thought I answered this, but it's not on the archive. Boh.)
>>> 
>>> Having a few thousands connection open is the only way to retrieve data
>>> respecting politeness (e.g., not banging the same site too often).
>> 
>> Huh?
>> There are surely other ways to achieve that goal.
>> 
> 
> I could not agree more. I personally think that closing idle connections
> and letting the server reclaim the resources associated with them
> (potentially enabling the server to serve other clients) would be more
> 'polite'. It is cheaper for both the client and the server to close
> connections more frequently than keeping them alive just in case.

Just to clarify, for our web crawl we were using a connection pool and letting idle connections
be reclaimed.

But we were also doing small batches of URLs (e.g. 5 at a time) when hitting the same server,
keeping the connection open. This was an attempt to balance the cost to the target server
of establishing a new connection, versus being polite. For typical web sites this feels like
a win, but low-traffic sites that have complex pages being generated by JSP code (for example)
could be unhappy. I know that Heritrix uses a strategy of varying their crawl delay based
on the response time of the server, which could be a better approach to constraining the #
of keep-alive requests.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Mime
View raw message