hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Khosro Asgharifard Sharabiani <khosro_quest...@yahoo.com>
Subject Re: Obtaining charset of page from HttpResponse.
Date Tue, 16 Aug 2011 15:36:16 GMT
Thanks Stijn,
I think your approach to use  BufferedHttpEntity is useful to avoid fetching twice,and
also find charset of a webpage.
 
Khosro.


>________________________________
>From: Stijn Deknudt <stijn@ebisi.be>
>To: HttpClient User Discussion <httpclient-users@hc.apache.org>; Khosro Asgharifard
Sharabiani <khosro_question@yahoo.com>
>Sent: Tuesday, August 16, 2011 5:57 PM
>Subject: Re: Obtaining charset of page from HttpResponse.
>
>I forgot to mention in my previous post that you can use
>BufferedHttpEntity when you would stream the content of the entity: in
>that case the content also gets fetched only once.
>
>Kind regards,
>Stijn.
>
>On 8/16/11, Stijn Deknudt <stijn@ebisi.be> wrote:
>> Hi Khosro,
>>
>> As described in http://www.w3.org/International/O-charset, there are
>> different ways to specify the content encoding. Because the site you
>> mention doesn't provide you the encoding in the header (see article:
>> Send the 'charset' parameter in the Content-Type header of HTTP),
>> you'll need to get the entity and find the encoding yourself in the
>> content.
>> One way to do this is to use EntityUtils to get the content, search
>> for the content-type meta-tag and use the charset to convert the
>> content with this information. This means you don't use the stream
>> directly (if you do this you'll need to fetch the content 2 times: one
>> time to consume the content until you retrieved the character set
>> information, and another time to consume the whole entity with this
>> character set).
>>
>> Kind regards,
>> Stijn.
>>
>> On 8/16/11, Khosro Asgharifard Sharabiani <khosro_question@yahoo.com>
>> wrote:
>>> Hi Stijn :
>>> I also use entity.getContentEncoding() ,but it returns "null".
>>> Is there any way to obtain charset of webpage?
>>> When we browse this page from a browser like FF,it renders charset ,but
>>> when
>>> we request with HttpClient or Curl ,we can not get charset?
>>> I think this is a big problem ,when we have a crawler.Because when we
>>> crawl
>>> of webpage ,HttpClient gives us  a stream,and we must know the charset of
>>> that webpage to save it in Database,but it seems in some webpage ,we can
>>> not
>>> get charset of that webpage.
>>>
>>> Khosro.
>>>
>>>
>>>>________________________________
>>>>From: Stijn Deknudt <stijn@ebisi.be>
>>>>To: HttpClient User Discussion <httpclient-users@hc.apache.org>
>>>>Cc: Khosro Asgharifard Sharabiani <khosro_question@yahoo.com>
>>>>Sent: Tuesday, August 16, 2011 4:38 PM
>>>>Subject: Re: Obtaining charset of page from HttpResponse.
>>>>
>>>>Hi Khosri,
>>>>
>>>>The Content-Type header is set (correctly) to "text/html", like Jon said.
>>>>There's no header in the response that says anything about the
>>>>character set, but you can obtain this information from the entity
>>>>itself: the HTML contains the character set inside the meta tag:
>>>><meta http-equiv="Content-Type" content="text/html;
>>>> charset=windows-1256">
>>>>
>>>>See also http://www.w3.org/International/O-charset to get more
>>>>information about all different possibilities to declare the character
>>>>encodings.
>>>>
>>>>Kind regards,
>>>>Stijn Deknudt.
>>>>
>>>>On 8/16/11, Jon Moore <jonm@apache.org> wrote:
>>>>> Hi,
>>>>>
>>>>> This is because the resource at www.annahar.com that you link to
>>>>> returns a Content-Type header that just reads "text/html":
>>>>>
>>>>> $ curl -v
>>>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon"
>>>>>>/dev/null
>>>>> * About to connect() to www.annahar.com port 80 (#0)
>>>>> *   Trying 66.242.155.235... connected
>>>>> * Connected to www.annahar.com (66.242.155.235) port 80 (#0)
>>>>>> GET /content.php?priority=1&table=main&type=main&day=Mon
HTTP/1.1
>>>>>> User-Agent: curl/7.16.4 (i386-apple-darwin9.0) libcurl/7.16.4
>>>>>> OpenSSL/0.9.7l zlib/1.2.3
>>>>>> Host: www.annahar.com
>>>>>> Accept: */*
>>>>>>
>>>>> < HTTP/1.1 200 OK
>>>>> < Connection: close
>>>>> < Date: Tue, 16 Aug 2011 11:50:50 GMT
>>>>> < Server: Microsoft-IIS/6.0
>>>>> < X-Powered-By: ASP.NET
>>>>> < X-Powered-By: PHP/5.2.0
>>>>> < Content-type: text/html
>>>>> <
>>>>>   % Total    % Received % Xferd  Average Speed   Time    Time 
   Time
>>>>> Current
>>>>>                                  Dload  Upload  
Total   Spent    Left
>>>>> Speed
>>>>>   0     0    0     0    0     0      0      0 --:--:--
--:--:--
>>>>> --:--:--     0{ [data not shown]
>>>>> 100 91340    0 91340    0     0   187k      0 --:--:-- --:--:--
>>>>> --:--:--  237k* Closing connection #0
>>>>>
>>>>> So httpclient is doing the right thing -- it's giving you access to
>>>>> exactly what's in the header that's returned.
>>>>>
>>>>> Jon
>>>>>
>>>>>
>>>>> On Tue, Aug 16, 2011 at 7:42 AM, Khosro Asgharifard Sharabiani
>>>>> <khosro_question@yahoo.com> wrote:
>>>>>> Hello,
>>>>>> I use the following code to find charset of a page,but it does not
>>>>>> worked
>>>>>> for page
>>>>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon"
>>>>>>
>>>>>> Code :
>>>>>>  [code]
>>>>>>
>>>>>> try {
>>>>>> HttpClient httpclient = new DefaultHttpClient();
>>>>>> String
>>>>>> url="http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon";
>>>>>> HttpGet httpget = new HttpGet(url);
>>>>>> HttpResponse response;
>>>>>> response = httpclient.execute(httpget);
>>>>>> HttpEntity entity = response.getEntity();
>>>>>> if (entity != null) {
>>>>>> Header[] allHeaders = response.getHeaders("Content-Type");
>>>>>> System.out.println(allHeaders[0].getValue());
>>>>>> }
>>>>>> } catch (ClientProtocolException e) {
>>>>>> e.printStackTrace();
>>>>>> } catch (IOException e) {
>>>>>> e.printStackTrace();
>>>>>> }
>>>>>> [/code]
>>>>>>
>>>>>>
>>>>>> And the output of above code is : text/html.
>>>>>> But i think the output must be "text/html; charset=windows-1256"
.Am i
>>>>>> right?
>>>>>>
>>>>>> But when i use
>>>>>> "http://bigbrowser.blog.lemonde.fr/2011/08/03/iran-le-mossad-derriere-le-meurtre-dun-scientifique-spiegel"
>>>>>> as a url in code,it returns "text/html; charset=UTF-8" ,that i think
>>>>>> ,it
>>>>>> is OK.
>>>>>> It seems ,it works for some pages not all of them.Why this happens?
>>>>>>
>>>>>>
>>>>>> Khosro.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
>>>>> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>>--
>>>>Stijn
>>>>stijn@ebisi.be
>>>>
>>>>
>>>>
>>
>
>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message