hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Obtaining charset of page from HttpResponse.
Date Tue, 16 Aug 2011 13:57:01 GMT
Hi Khosro,

Detecting the charset for an arbitrary HTML page is a non-trivial problem, and not something
that is in scope for HttpClient.

E.g. sometimes the response header has no charset, and there's nothing in the HTML <meta>
tag.

In that case, browsers (and web crawlers) use statistical analysis to guess at the appropriate
charset.

One suggestion - you can use Tika to process a web page and detect the charset.

-- Ken

On Aug 16, 2011, at 6:07am, Jon Moore wrote:

> Hi Khosro,
> 
> Stijn is saying that you need to parse the text/html response body and
> look for the <meta> tag that contains the charset. There are multiple
> places the charset for an HTML webpage can be specified: please see
> the link that Stijn sent for more details.
> 
> Jon
> 
> On Tue, Aug 16, 2011 at 8:40 AM, Khosro Asgharifard Sharabiani
> <khosro_question@yahoo.com> wrote:
>> Hi Stijn :
>> I also use entity.getContentEncoding() ,but it returns "null".
>> Is there any way to obtain charset of webpage?
>> When we browse this page from a browser like FF,it renders charset ,but when we request
with HttpClient or Curl ,we can not get charset?
>> I think this is a big problem ,when we have a crawler.Because when we crawl of webpage
,HttpClient gives us  a stream,and we must know the charset of that webpage to save it in
Database,but it seems in some webpage ,we can not get charset of that webpage.
>> 
>> Khosro.
>> 
>> 
>>> ________________________________
>>> From: Stijn Deknudt <stijn@ebisi.be>
>>> To: HttpClient User Discussion <httpclient-users@hc.apache.org>
>>> Cc: Khosro Asgharifard Sharabiani <khosro_question@yahoo.com>
>>> Sent: Tuesday, August 16, 2011 4:38 PM
>>> Subject: Re: Obtaining charset of page from HttpResponse.
>>> 
>>> Hi Khosri,
>>> 
>>> The Content-Type header is set (correctly) to "text/html", like Jon said.
>>> There's no header in the response that says anything about the
>>> character set, but you can obtain this information from the entity
>>> itself: the HTML contains the character set inside the meta tag:
>>> <meta http-equiv="Content-Type" content="text/html; charset=windows-1256">
>>> 
>>> See also http://www.w3.org/International/O-charset to get more
>>> information about all different possibilities to declare the character
>>> encodings.
>>> 
>>> Kind regards,
>>> Stijn Deknudt.
>>> 
>>> On 8/16/11, Jon Moore <jonm@apache.org> wrote:
>>>> Hi,
>>>> 
>>>> This is because the resource at www.annahar.com that you link to
>>>> returns a Content-Type header that just reads "text/html":
>>>> 
>>>> $ curl -v
>>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon"
>>>>> /dev/null
>>>> * About to connect() to www.annahar.com port 80 (#0)
>>>> *   Trying 66.242.155.235... connected
>>>> * Connected to www.annahar.com (66.242.155.235) port 80 (#0)
>>>>> GET /content.php?priority=1&table=main&type=main&day=Mon
HTTP/1.1
>>>>> User-Agent: curl/7.16.4 (i386-apple-darwin9.0) libcurl/7.16.4
>>>>> OpenSSL/0.9.7l zlib/1.2.3
>>>>> Host: www.annahar.com
>>>>> Accept: */*
>>>>> 
>>>> < HTTP/1.1 200 OK
>>>> < Connection: close
>>>> < Date: Tue, 16 Aug 2011 11:50:50 GMT
>>>> < Server: Microsoft-IIS/6.0
>>>> < X-Powered-By: ASP.NET
>>>> < X-Powered-By: PHP/5.2.0
>>>> < Content-type: text/html
>>>> <
>>>>    % Total    % Received % Xferd  Average Speed   Time    Time     Time
>>>> Current
>>>>                                   Dload  Upload   Total   Spent    Left
>>>> Speed
>>>>    0     0    0     0    0     0      0      0 --:--:-- --:--:--
>>>> --:--:--     0{ [data not shown]
>>>> 100 91340    0 91340    0     0   187k      0 --:--:-- --:--:--
>>>> --:--:--  237k* Closing connection #0
>>>> 
>>>> So httpclient is doing the right thing -- it's giving you access to
>>>> exactly what's in the header that's returned.
>>>> 
>>>> Jon
>>>> 
>>>> 
>>>> On Tue, Aug 16, 2011 at 7:42 AM, Khosro Asgharifard Sharabiani
>>>> <khosro_question@yahoo.com> wrote:
>>>>> Hello,
>>>>> I use the following code to find charset of a page,but it does not worked
>>>>> for page
>>>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon"
>>>>> 
>>>>> Code :
>>>>>  [code]
>>>>> 
>>>>> try {
>>>>> HttpClient httpclient = new DefaultHttpClient();
>>>>> String
>>>>> url="http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon";
>>>>> HttpGet httpget = new HttpGet(url);
>>>>> HttpResponse response;
>>>>> response = httpclient.execute(httpget);
>>>>> HttpEntity entity = response.getEntity();
>>>>> if (entity != null) {
>>>>> Header[] allHeaders = response.getHeaders("Content-Type");
>>>>> System.out.println(allHeaders[0].getValue());
>>>>> }
>>>>> } catch (ClientProtocolException e) {
>>>>> e.printStackTrace();
>>>>> } catch (IOException e) {
>>>>> e.printStackTrace();
>>>>> }
>>>>> [/code]
>>>>> 
>>>>> 
>>>>> And the output of above code is : text/html.
>>>>> But i think the output must be "text/html; charset=windows-1256" .Am
i
>>>>> right?
>>>>> 
>>>>> But when i use
>>>>> "http://bigbrowser.blog.lemonde.fr/2011/08/03/iran-le-mossad-derriere-le-meurtre-dun-scientifique-spiegel"
>>>>> as a url in code,it returns "text/html; charset=UTF-8" ,that i think
,it
>>>>> is OK.
>>>>> It seems ,it works for some pages not all of them.Why this happens?
>>>>> 
>>>>> 
>>>>> Khosro.
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
>>>> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Stijn
>>> stijn@ebisi.be
>>> 
>>> 
>>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions







---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Mime
View raw message