hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stijn Deknudt <st...@ebisi.be>
Subject Re: Obtaining charset of page from HttpResponse.
Date Tue, 16 Aug 2011 13:16:47 GMT
Hi Khosro,

As described in http://www.w3.org/International/O-charset, there are
different ways to specify the content encoding. Because the site you
mention doesn't provide you the encoding in the header (see article:
Send the 'charset' parameter in the Content-Type header of HTTP),
you'll need to get the entity and find the encoding yourself in the
content.
One way to do this is to use EntityUtils to get the content, search
for the content-type meta-tag and use the charset to convert the
content with this information. This means you don't use the stream
directly (if you do this you'll need to fetch the content 2 times: one
time to consume the content until you retrieved the character set
information, and another time to consume the whole entity with this
character set).

Kind regards,
Stijn.

On 8/16/11, Khosro Asgharifard Sharabiani <khosro_question@yahoo.com> wrote:
> Hi Stijn :
> I also use entity.getContentEncoding() ,but it returns "null".
> Is there any way to obtain charset of webpage?
> When we browse this page from a browser like FF,it renders charset ,but when
> we request with HttpClient or Curl ,we can not get charset?
> I think this is a big problem ,when we have a crawler.Because when we crawl
> of webpage ,HttpClient gives us  a stream,and we must know the charset of
> that webpage to save it in Database,but it seems in some webpage ,we can not
> get charset of that webpage.
>
> Khosro.
>
>
>>________________________________
>>From: Stijn Deknudt <stijn@ebisi.be>
>>To: HttpClient User Discussion <httpclient-users@hc.apache.org>
>>Cc: Khosro Asgharifard Sharabiani <khosro_question@yahoo.com>
>>Sent: Tuesday, August 16, 2011 4:38 PM
>>Subject: Re: Obtaining charset of page from HttpResponse.
>>
>>Hi Khosri,
>>
>>The Content-Type header is set (correctly) to "text/html", like Jon said.
>>There's no header in the response that says anything about the
>>character set, but you can obtain this information from the entity
>>itself: the HTML contains the character set inside the meta tag:
>><meta http-equiv="Content-Type" content="text/html; charset=windows-1256">
>>
>>See also http://www.w3.org/International/O-charset to get more
>>information about all different possibilities to declare the character
>>encodings.
>>
>>Kind regards,
>>Stijn Deknudt.
>>
>>On 8/16/11, Jon Moore <jonm@apache.org> wrote:
>>> Hi,
>>>
>>> This is because the resource at www.annahar.com that you link to
>>> returns a Content-Type header that just reads "text/html":
>>>
>>> $ curl -v
>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon"
>>>>/dev/null
>>> * About to connect() to www.annahar.com port 80 (#0)
>>> *   Trying 66.242.155.235... connected
>>> * Connected to www.annahar.com (66.242.155.235) port 80 (#0)
>>>> GET /content.php?priority=1&table=main&type=main&day=Mon HTTP/1.1
>>>> User-Agent: curl/7.16.4 (i386-apple-darwin9.0) libcurl/7.16.4
>>>> OpenSSL/0.9.7l zlib/1.2.3
>>>> Host: www.annahar.com
>>>> Accept: */*
>>>>
>>> < HTTP/1.1 200 OK
>>> < Connection: close
>>> < Date: Tue, 16 Aug 2011 11:50:50 GMT
>>> < Server: Microsoft-IIS/6.0
>>> < X-Powered-By: ASP.NET
>>> < X-Powered-By: PHP/5.2.0
>>> < Content-type: text/html
>>> <
>>>   % Total    % Received % Xferd  Average Speed   Time    Time     Time
>>> Current
>>>                                  Dload  Upload   Total  
Spent    Left
>>> Speed
>>>   0     0    0     0    0     0      0      0 --:--:-- --:--:--
>>> --:--:--     0{ [data not shown]
>>> 100 91340    0 91340    0     0   187k      0 --:--:-- --:--:--
>>> --:--:--  237k* Closing connection #0
>>>
>>> So httpclient is doing the right thing -- it's giving you access to
>>> exactly what's in the header that's returned.
>>>
>>> Jon
>>>
>>>
>>> On Tue, Aug 16, 2011 at 7:42 AM, Khosro Asgharifard Sharabiani
>>> <khosro_question@yahoo.com> wrote:
>>>> Hello,
>>>> I use the following code to find charset of a page,but it does not
>>>> worked
>>>> for page
>>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon"
>>>>
>>>> Code :
>>>>  [code]
>>>>
>>>> try {
>>>> HttpClient httpclient = new DefaultHttpClient();
>>>> String
>>>> url="http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon";
>>>> HttpGet httpget = new HttpGet(url);
>>>> HttpResponse response;
>>>> response = httpclient.execute(httpget);
>>>> HttpEntity entity = response.getEntity();
>>>> if (entity != null) {
>>>> Header[] allHeaders = response.getHeaders("Content-Type");
>>>> System.out.println(allHeaders[0].getValue());
>>>> }
>>>> } catch (ClientProtocolException e) {
>>>> e.printStackTrace();
>>>> } catch (IOException e) {
>>>> e.printStackTrace();
>>>> }
>>>> [/code]
>>>>
>>>>
>>>> And the output of above code is : text/html.
>>>> But i think the output must be "text/html; charset=windows-1256" .Am i
>>>> right?
>>>>
>>>> But when i use
>>>> "http://bigbrowser.blog.lemonde.fr/2011/08/03/iran-le-mossad-derriere-le-meurtre-dun-scientifique-spiegel"
>>>> as a url in code,it returns "text/html; charset=UTF-8" ,that i think ,it
>>>> is OK.
>>>> It seems ,it works for some pages not all of them.Why this happens?
>>>>
>>>>
>>>> Khosro.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
>>> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>>>
>>>
>>
>>
>>--
>>Stijn
>>stijn@ebisi.be
>>
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Mime
View raw message