hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Kalnichevski <ol...@apache.org>
Subject Re: OOM problem
Date Tue, 11 Feb 2014 13:39:19 GMT
On Mon, 2014-02-10 at 20:57 -0800, Ken Krugler wrote:
> If you're crawling web pages, you need to have a limit to the amount of data any page
returns.
> 
> Otherwise you'll eventually run into a site that returns an unbounded amount of data,
which will kill your JVM.
> 
> See SimpleHttpFetcher in Bixo for an example of one way to do this type of limiting (though
not optimal).
> 
> -- Ken
> 
> 
> On Feb 10, 2014, at 8:07pm, Li Li <fancyerii@gmail.com> wrote:
> 
> > I am using httpclient 4.3 to crawl webpages.
> > I start 200 threads and PoolingHttpClientConnectionManager with
> > totalMax 1000 and perHostMax 5
> > I give java 2GB memory and one thread throws an exception(others still
> > running, this thread is dead)
> > 
> > Exception in thread "Thread-156" java.lang.OutOfMemoryError: Java heap space
> >        at org.apache.http.util.ByteArrayBuffer.<init>(ByteArrayBuffer.java:56)
> >        at org.apache.http.util.EntityUtils.toByteArray(EntityUtils.java:133)

Moreover, buffering response content in memory (either as byte array or
string) sounds like a really bad idea to me.

Oleg


> >        at com.founder.httpclientfetcher.HttpClientFetcher$3.handleResponse(HttpClientFetcher.java:221)
> >        at com.founder.httpclientfetcher.HttpClientFetcher$3.handleResponse(HttpClientFetcher.java:211)
> >        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:218)
> >        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:160)
> >        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:136)
> >        at com.founder.httpclientfetcher.HttpClientFetcher.httpGet(HttpClientFetcher.java:233)
> >        at com.founder.vcfetcher.CrawlWorker.getContent(CrawlWorker.java:198)
> >        at com.founder.vcfetcher.CrawlWorker.doWork(CrawlWorker.java:134)
> >        at com.founder.vcfetcher.CrawlWorker.run(CrawlWorker.java:231)
> > 
> > does it mean my code has some memory leak probelm?
> > 
> > my codes:
> > public String httpGet(String url) throws Exception {
> > if (!isValid)
> > throw new RuntimeException("not valid now, you should init first");
> > HttpGet httpget = new HttpGet(url);
> > 
> > // Create a custom response handler
> > ResponseHandler<String> responseHandler = new ResponseHandler<String>()
{
> > 
> > public String handleResponse(final HttpResponse response)
> > throws ClientProtocolException, IOException {
> > int status = response.getStatusLine().getStatusCode();
> > if (status >= 200 && status < 300) {
> > HttpEntity entity = response.getEntity();
> > if (entity == null)
> > return null;
> > 
> > byte[] bytes = EntityUtils.toByteArray(entity);
> > String charSet = CharsetDetector.getCharset(bytes);
> > 
> > return new String(bytes, charSet);
> > } else {
> > throw new ClientProtocolException(
> > "Unexpected response status: " + status);
> > }
> > }
> > 
> > };
> > 
> > String responseBody = client.execute(httpget, responseHandler);
> > return responseBody;
> > }
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> > For additional commands, e-mail: httpclient-users-help@hc.apache.org
> > 
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 
> 
> 
> 
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Mime
View raw message