hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: getting only the header
Date Tue, 26 Jan 2010 13:42:08 GMT

On Jan 26, 2010, at 3:54am, Claudio Martella wrote:

> As I mentioned in the previous post, i'm using httpclient for a
> webcrawler i'm writing. at the moment i'm doing something like this:
>
>
>    while(toVisit.size() > 0){
>
>                      client.execute(method);
>                      String mime = getContentType(method); // which
> does method.getResponseHeader("Content-Type").getValue();
>
>                      if(supportedMimes.contains(mime){
>                          handle(method.getResponseBody());
>                      } else {
>                          continue;
>                      }
>    }
>
> the problem is that i can see that the crawler hangs up a lot of time
> processing urls that are going to be ignored. so i guess it's
> downloading the whole stream before ignoring it. is there a way i can
> download just the header, check the content type and only then  
> download
> the stream (at the time of getResponseBody())?

See the code I'd previously referenced for an example of exactly that.

http://github.com/bixo/bixo/blob/master/src/main/java/bixo/fetcher/http/SimpleHttpFetcher.java

Make sure you abort the request if you skip getting the response.

-- Ken


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message