hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: OOM problem
Date Tue, 11 Feb 2014 04:57:58 GMT
If you're crawling web pages, you need to have a limit to the amount of data any page returns.

Otherwise you'll eventually run into a site that returns an unbounded amount of data, which
will kill your JVM.

See SimpleHttpFetcher in Bixo for an example of one way to do this type of limiting (though
not optimal).

-- Ken


On Feb 10, 2014, at 8:07pm, Li Li <fancyerii@gmail.com> wrote:

> I am using httpclient 4.3 to crawl webpages.
> I start 200 threads and PoolingHttpClientConnectionManager with
> totalMax 1000 and perHostMax 5
> I give java 2GB memory and one thread throws an exception(others still
> running, this thread is dead)
> 
> Exception in thread "Thread-156" java.lang.OutOfMemoryError: Java heap space
>        at org.apache.http.util.ByteArrayBuffer.<init>(ByteArrayBuffer.java:56)
>        at org.apache.http.util.EntityUtils.toByteArray(EntityUtils.java:133)
>        at com.founder.httpclientfetcher.HttpClientFetcher$3.handleResponse(HttpClientFetcher.java:221)
>        at com.founder.httpclientfetcher.HttpClientFetcher$3.handleResponse(HttpClientFetcher.java:211)
>        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:218)
>        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:160)
>        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:136)
>        at com.founder.httpclientfetcher.HttpClientFetcher.httpGet(HttpClientFetcher.java:233)
>        at com.founder.vcfetcher.CrawlWorker.getContent(CrawlWorker.java:198)
>        at com.founder.vcfetcher.CrawlWorker.doWork(CrawlWorker.java:134)
>        at com.founder.vcfetcher.CrawlWorker.run(CrawlWorker.java:231)
> 
> does it mean my code has some memory leak probelm?
> 
> my codes:
> public String httpGet(String url) throws Exception {
> if (!isValid)
> throw new RuntimeException("not valid now, you should init first");
> HttpGet httpget = new HttpGet(url);
> 
> // Create a custom response handler
> ResponseHandler<String> responseHandler = new ResponseHandler<String>() {
> 
> public String handleResponse(final HttpResponse response)
> throws ClientProtocolException, IOException {
> int status = response.getStatusLine().getStatusCode();
> if (status >= 200 && status < 300) {
> HttpEntity entity = response.getEntity();
> if (entity == null)
> return null;
> 
> byte[] bytes = EntityUtils.toByteArray(entity);
> String charSet = CharsetDetector.getCharset(bytes);
> 
> return new String(bytes, charSet);
> } else {
> throw new ClientProtocolException(
> "Unexpected response status: " + status);
> }
> }
> 
> };
> 
> String responseBody = client.execute(httpget, responseHandler);
> return responseBody;
> }
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message