hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stijn Deknudt <st...@ebisi.be>
Subject Re: Obtaining charset of page from HttpResponse.
Date Tue, 16 Aug 2011 12:08:06 GMT
Hi Khosri,

The Content-Type header is set (correctly) to "text/html", like Jon said.
There's no header in the response that says anything about the
character set, but you can obtain this information from the entity
itself: the HTML contains the character set inside the meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=windows-1256">

See also http://www.w3.org/International/O-charset to get more
information about all different possibilities to declare the character
encodings.

Kind regards,
Stijn Deknudt.

On 8/16/11, Jon Moore <jonm@apache.org> wrote:
> Hi,
>
> This is because the resource at www.annahar.com that you link to
> returns a Content-Type header that just reads "text/html":
>
> $ curl -v
> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon"
>>/dev/null
> * About to connect() to www.annahar.com port 80 (#0)
> *   Trying 66.242.155.235... connected
> * Connected to www.annahar.com (66.242.155.235) port 80 (#0)
>> GET /content.php?priority=1&table=main&type=main&day=Mon HTTP/1.1
>> User-Agent: curl/7.16.4 (i386-apple-darwin9.0) libcurl/7.16.4
>> OpenSSL/0.9.7l zlib/1.2.3
>> Host: www.annahar.com
>> Accept: */*
>>
> < HTTP/1.1 200 OK
> < Connection: close
> < Date: Tue, 16 Aug 2011 11:50:50 GMT
> < Server: Microsoft-IIS/6.0
> < X-Powered-By: ASP.NET
> < X-Powered-By: PHP/5.2.0
> < Content-type: text/html
> <
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time
> Current
>                                  Dload  Upload   Total   Spent    Left
> Speed
>   0     0    0     0    0     0      0      0 --:--:-- --:--:--
> --:--:--     0{ [data not shown]
> 100 91340    0 91340    0     0   187k      0 --:--:-- --:--:--
> --:--:--  237k* Closing connection #0
>
> So httpclient is doing the right thing -- it's giving you access to
> exactly what's in the header that's returned.
>
> Jon
>
>
> On Tue, Aug 16, 2011 at 7:42 AM, Khosro Asgharifard Sharabiani
> <khosro_question@yahoo.com> wrote:
>> Hello,
>> I use the following code to find charset of a page,but it does not worked
>> for page
>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon"
>>
>> Code :
>>  [code]
>>
>> try {
>> HttpClient httpclient = new DefaultHttpClient();
>> String
>> url="http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon";
>> HttpGet httpget = new HttpGet(url);
>> HttpResponse response;
>> response = httpclient.execute(httpget);
>> HttpEntity entity = response.getEntity();
>> if (entity != null) {
>> Header[] allHeaders = response.getHeaders("Content-Type");
>> System.out.println(allHeaders[0].getValue());
>> }
>> } catch (ClientProtocolException e) {
>> e.printStackTrace();
>> } catch (IOException e) {
>> e.printStackTrace();
>> }
>> [/code]
>>
>>
>> And the output of above code is : text/html.
>> But i think the output must be "text/html; charset=windows-1256" .Am i
>> right?
>>
>> But when i use
>> "http://bigbrowser.blog.lemonde.fr/2011/08/03/iran-le-mossad-derriere-le-meurtre-dun-scientifique-spiegel"
>> as a url in code,it returns "text/html; charset=UTF-8" ,that i think ,it
>> is OK.
>> It seems ,it works for some pages not all of them.Why this happens?
>>
>>
>> Khosro.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>
>


-- 
Stijn
stijn@ebisi.be

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Mime
View raw message