hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Header and Content parsing and saving as html page
Date Fri, 11 Feb 2011 14:00:29 GMT
Normally you'd let HttpClient handle decoding, chunked responses, etc.  
Then what you save is the raw content (as an array of bytes) and the  
response headers.

Converting the above into a parsable page is something best handled by  
Tika (as an example), since it will attempt to determine the charset  
encoding for the bytes, based on the response header, the HTML markup,  
and (worst case) statistics from the bytes.

The second, off-line step has nothing to do with HttpClient.

-- Ken

On Feb 11, 2011, at 1:30am, CodingForever wrote:

>
> I appreciated.
> That is working like I want.
> You see that, i am trying to decoding html page using header and
> content(offline). And I am not a perfect about httpclient. So I  
> could not
> find the best solution for my problem.
>
> Think that you have a,
> Header and Content(raw content(gzip,deflate,may be chunked) )
> I need a solution that , I will give header and content then reading  
> the
> decoded output until the end of the page.
> Can you offer me a solution for this problem?
>
> Best Regards.
>
> olegk wrote:
>>
>> On Fri, 2011-02-11 at 00:39 -0800, CodingForever wrote:
>>> Thanks olegk for the answer,Now I am looking that. But I will ask
>>> something
>>> I wrote the code that below. How can I get the decoded content using
>>> header
>>> parameters ?
>>
>> String s =
>>    "HTTP/1.1 200 OK\r\n"
>>    + "Server: whatever\r\n"
>>    + "Date: some date\r\n"
>>    + "Set-Cookie: c1=stuff\r\n"
>>    + "Transfer-Encoding: chunked\r\n"
>>    + "Content-Type: text/html; charset=ISO-8859-1\r\n"
>>    + "\r\n"
>>    + "5\r\n01234\r\n5\r\n56789\r\n6\r\nabcdef\r\n0\r\n\r\n test";
>> SessionInputBuffer inbuffer = new SessionInputBufferMockup(s,
>> "US-ASCII");
>> HttpResponseParser parser = new HttpResponseParser(
>>    inbuffer,
>>    BasicLineParser.DEFAULT,
>>    new DefaultHttpResponseFactory(),
>>    new BasicHttpParams());
>> HttpResponse response = (HttpResponse) parser.parse();
>> EntityDeserializer deserializer = new EntityDeserializer(new
>> LaxContentLengthStrategy());
>> HttpEntity entity = deserializer.deserialize(inbuffer, response);
>> System.out.println(EntityUtils.toString(entity, HTTP.ASCII));
>>
>> ---
>>
>> Oleg
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
>> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>>
>>
>>
>
> -- 
> View this message in context: http://old.nabble.com/Header-and-Content-parsing-and-saving-as-html-page-tp30897495p30899580.html
> Sent from the HttpClient-User mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Mime
View raw message