hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: getting only the header
Date Tue, 26 Jan 2010 19:37:50 GMT

On Jan 26, 2010, at 6:36am, Claudio Martella wrote:

> I see that you do:
>
> if (mimeTypes != null) {
>                String mimeType =
> HttpUtils.getMimeTypeFromContentType(contentType);
>                if (!mimeTypes.contains(mimeType)) {
>                    throw new AbortedFetchException(url, "Invalid
> mime-type: " + mimeType, AbortedFetchReason.INVALID_MIMETYPE);
>                }
> }
>
> my question is then: what is it done at the caller side with the
> handling of AbortedFetchException handling the httpclient?

Nothing. The finally {} block for where that exception is thrown calls  
safeAbort with the HttpGet object, and that in turn calls  
request.abort().

-- Ken


>
>
>
>
> Ken Krugler wrote:
>>
>> On Jan 26, 2010, at 3:54am, Claudio Martella wrote:
>>
>>> As I mentioned in the previous post, i'm using httpclient for a
>>> webcrawler i'm writing. at the moment i'm doing something like this:
>>>
>>>
>>>   while(toVisit.size() > 0){
>>>
>>>                     client.execute(method);
>>>                     String mime = getContentType(method); // which
>>> does method.getResponseHeader("Content-Type").getValue();
>>>
>>>                     if(supportedMimes.contains(mime){
>>>                         handle(method.getResponseBody());
>>>                     } else {
>>>                         continue;
>>>                     }
>>>   }
>>>
>>> the problem is that i can see that the crawler hangs up a lot of  
>>> time
>>> processing urls that are going to be ignored. so i guess it's
>>> downloading the whole stream before ignoring it. is there a way i  
>>> can
>>> download just the header, check the content type and only then  
>>> download
>>> the stream (at the time of getResponseBody())?
>>
>> See the code I'd previously referenced for an example of exactly  
>> that.
>>
>> http://github.com/bixo/bixo/blob/master/src/main/java/bixo/fetcher/http/SimpleHttpFetcher.java
>>
>>
>> Make sure you abort the request if you skip getting the response.
>>
>> -- Ken
>>
>>
>> --------------------------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>>
>>
>>
>>
>
>
> -- 
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> claudio.martella@tis.bz.it http://www.tis.bz.it
>
> Short information regarding use of personal data. According to  
> Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we  
> inform you that we process your personal data in order to fulfil  
> contractual and fiscal obligations and also to send you information  
> regarding our services and events. Your personal data are processed  
> with and without electronic means and by respecting data subjects'  
> rights, fundamental freedoms and dignity, particularly with regard  
> to confidentiality, personal identity and the right to personal data  
> protection. At any time and without formalities you can write an e- 
> mail to privacy@tis.bz.it in order to object the processing of your  
> personal data for the purpose of sending advertising materials and  
> also to exercise the right to access personal data and other rights  
> referred to in Section 7 of Decree 196/2003. The data controller is  
> TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You  
> can find the complete information on the web site www.tis.bz.it.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Mime
View raw message