hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Kalnichevski <ol...@apache.org>
Subject Re: Memory leak using httpclient
Date Tue, 14 Mar 2006 10:57:16 GMT
On Tue, 2006-03-14 at 01:52 -0500, James Ostheimer wrote: 
> Hi-
> 
> I am using httpclient in a multi-threaded webcrawler application.  I am using the MulitThreadedHttpConnectionManager
in conjunction with 300 threads that download pages from various sites.
> 
> Problem is that I am running out of memory shortly after the process begins.  I used
JProfiler to analyze the memory stacks and it points to:
>   a.. 76.2% - 233,587 kB - 6,626 alloc. org.apache.commons.httpclient.HttpMethod.getResponseBodyAsString

> as the culprit (at most there should be a little over 300 allocations as there are 300
threads operating at once).  Other relevant information, I am on a Windows XP Pro platform
using the SUN JRE that came with jdk1.5.0_06.  I am using commons-httpclient-3.0.jar.
> 

James,

There's no memory leak in HttpClient. Just do not use
HttpMethod#getResponseBodyAsString() method which is not intended to
retrieval of response entities of arbitrary length, because it buffers
the entire response content in memory in order to to convert it a
String. If your crawler hits a site that generates an endless stream of
garbage the JVM is bound to run out of memory.

Use getResponseBodyAsStream() instead.

Hope this helps

Oleg

> Here is the code where I initialize the HttpClient:
> 
> private HttpClient httpClient; 
>  
>  public CrawlerControllerThread(QueueThread qt, MessageReceiver receiver, int maxThreads,
String flag,
>    boolean filter, String filterString, String dbType) {
>   this.qt = qt;
>   this.receiver = receiver;
>   this.maxThreads = maxThreads;
>   this.flag = flag;
>   this.filter = filter;
>   this.filterString = filterString;
>   this.dbType = dbType;
>   threads = new ArrayList();
>   lastStatus = new HashMap();
>   
>   HttpConnectionManagerParams htcmp = new HttpConnectionManagerParams();
>   htcmp.setMaxTotalConnections(maxThreads);
>   htcmp.setDefaultMaxConnectionsPerHost(10);
>   htcmp.setSoTimeout(5000);
>   MultiThreadedHttpConnectionManager mtcm = new MultiThreadedHttpConnectionManager();
>   mtcm.setParams(htcmp);
>   httpClient = new HttpClient(mtcm);
>   
>   
>  }
> 
> The client reference to httpClient is then passed to all the crawling threads where it
is used as follows:
> 
> private String getPageApache(URL pageURL, ArrayList unProcessed) {
>   SaveURL saveURL = new SaveURL();
>   HttpMethod method = null;
>   HttpURLConnection urlConnection = null;
>   String rawPage = "";
>   try {
>    method = new GetMethod(pageURL.toExternalForm());
>    method.setFollowRedirects(true);
>    method.setRequestHeader("Content-type", "text/html");
>    int statusCode = httpClient.executeMethod(method);
> //   urlConnection = new HttpURLConnection(method,
> //     pageURL);
>    logger.debug("Requesting: "+pageURL.toExternalForm());
> 
>    
>    rawPage = method.getResponseBodyAsString();
>    //rawPage = saveURL.getURL(urlConnection);
>    if(rawPage == null){
>     unProcessed.add(pageURL);
>    } 
>    return rawPage;
>   } catch (IllegalArgumentException e) {
>    //e.printStackTrace();
>    
>   } 
>   catch (HttpException e) {
>    
>    //e.printStackTrace();
>   } catch (IOException e) {
>    unProcessed.add(pageURL);
>    //e.printStackTrace();
>   }finally {
>    if(method != null) {
>     method.releaseConnection();
>    }
>    try {
>     if(urlConnection != null) {
>      if(urlConnection.getInputStream() != null) {
>       urlConnection.getInputStream().close();
>      }
>     }
>    } catch (IOException e) {
>     // TODO Auto-generated catch block
>     e.printStackTrace();
>    }
>    urlConnection = null;
>    method = null;
>   }
>   return null;
>  }
> 
> As you can see, I release the connection in the finally statement, so that should not
be a problem. Upon running the getPageApache above the returned page as a string is processed
and then set to null for garbage collection. I have been playing with this, closing streams,
using HttpUrlConnection instead of the GetMethod, and I cannot find the answer.  Indeed it
seems the answer does not lie in my code.  
> 
> I greatly appreciate any help that anyone can give me, I am at the end of my ropes with
this one.
> 
> James


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-user-help@jakarta.apache.org


Mime
View raw message