hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Khosro Asgharifard Sharabiani <khosro_quest...@yahoo.com>
Subject Re: Obtaining charset of page from HttpResponse.
Date Tue, 16 Aug 2011 15:41:01 GMT
Hi Ken,
Maybe using Tika is well ,but i have not used it and i must investigate more about your approach.
Anyway ,i think Stijn's approach to use BufferedHttpEntity is useful for now.
 
Khosro.


>________________________________
>From: Ken Krugler <kkrugler_lists@transpac.com>
>To: HttpClient User Discussion <httpclient-users@hc.apache.org>
>Sent: Tuesday, August 16, 2011 6:27 PM
>Subject: Re: Obtaining charset of page from HttpResponse.
>
>Hi Khosro,
>
>Detecting the charset for an arbitrary HTML page is a non-trivial problem, and not something
that is in scope for HttpClient.
>
>E.g. sometimes the response header has no charset, and there's nothing in the HTML <meta>
tag.
>
>In that case, browsers (and web crawlers) use statistical analysis to guess at the appropriate
charset.
>
>One suggestion - you can use Tika to process a web page and detect the charset.
>
>-- Ken
>
>On Aug 16, 2011, at 6:07am, Jon Moore wrote:
>
>> Hi Khosro,
>> 
>> Stijn is saying that you need to parse the text/html response body and
>> look for the <meta> tag that contains the charset. There are multiple
>> places the charset for an HTML webpage can be specified: please see
>> the link that Stijn sent for more details.
>> 
>> Jon
>> 
>> On Tue, Aug 16, 2011 at 8:40 AM, Khosro Asgharifard Sharabiani
>> <khosro_question@yahoo.com> wrote:
>>> Hi Stijn :
>>> I also use entity.getContentEncoding() ,but it returns "null".
>>> Is there any way to obtain charset of webpage?
>>> When we browse this page from a browser like FF,it renders charset ,but when
we request with HttpClient or Curl ,we can not get charset?
>>> I think this is a big problem ,when we have a crawler.Because when we crawl of
webpage ,HttpClient gives us  a stream,and we must know the charset of that webpage to save
it in Database,but it seems in some webpage ,we can not get charset of that webpage.
>>> 
>>> Khosro.
>>> 
>>> 
>>>> ________________________________
>>>> From: Stijn Deknudt <stijn@ebisi.be>
>>>> To: HttpClient User Discussion <httpclient-users@hc.apache.org>
>>>> Cc: Khosro Asgharifard Sharabiani <khosro_question@yahoo.com>
>>>> Sent: Tuesday, August 16, 2011 4:38 PM
>>>> Subject: Re: Obtaining charset of page from HttpResponse.
>>>> 
>>>> Hi Khosri,
>>>> 
>>>> The Content-Type header is set (correctly) to "text/html", like Jon said.
>>>> There's no header in the response that says anything about the
>>>> character set, but you can obtain this information from the entity
>>>> itself: the HTML contains the character set inside the meta tag:
>>>> <meta http-equiv="Content-Type" content="text/html; charset=windows-1256">
>>>> 
>>>> See also http://www.w3.org/International/O-charset to get more
>>>> information about all different possibilities to declare the character
>>>> encodings.
>>>> 
>>>> Kind regards,
>>>> Stijn Deknudt.
>>>> 
>>>> On 8/16/11, Jon Moore <jonm@apache.org> wrote:
>>>>> Hi,
>>>>> 
>>>>> This is because the resource at www.annahar.com that you link to
>>>>> returns a Content-Type header that just reads "text/html":
>>>>> 
>>>>> $ curl -v
>>>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon"
>>>>>> /dev/null
>>>>> * About to connect() to www.annahar.com port 80 (#0)
>>>>> *   Trying 66.242.155.235... connected
>>>>> * Connected to www.annahar.com (66.242.155.235) port 80 (#0)
>>>>>> GET /content.php?priority=1&table=main&type=main&day=Mon
HTTP/1.1
>>>>>> User-Agent: curl/7.16.4 (i386-apple-darwin9.0) libcurl/7.16.4
>>>>>> OpenSSL/0.9.7l zlib/1.2.3
>>>>>> Host: www.annahar.com
>>>>>> Accept: */*
>>>>>> 
>>>>> < HTTP/1.1 200 OK
>>>>> < Connection: close
>>>>> < Date: Tue, 16 Aug 2011 11:50:50 GMT
>>>>> < Server: Microsoft-IIS/6.0
>>>>> < X-Powered-By: ASP.NET
>>>>> < X-Powered-By: PHP/5.2.0
>>>>> < Content-type: text/html
>>>>> <
>>>>>    % Total    % Received % Xferd  Average Speed   Time    Time 
   Time
>>>>> Current
>>>>>                                   Dload  Upload  
Total   Spent    Left
>>>>> Speed
>>>>>    0     0    0     0    0     0      0      0 --:--:--
--:--:--
>>>>> --:--:--     0{ [data not shown]
>>>>> 100 91340    0 91340    0     0   187k      0 --:--:-- --:--:--
>>>>> --:--:--  237k* Closing connection #0
>>>>> 
>>>>> So httpclient is doing the right thing -- it's giving you access to
>>>>> exactly what's in the header that's returned.
>>>>> 
>>>>> Jon
>>>>> 
>>>>> 
>>>>> On Tue, Aug 16, 2011 at 7:42 AM, Khosro Asgharifard Sharabiani
>>>>> <khosro_question@yahoo.com> wrote:
>>>>>> Hello,
>>>>>> I use the following code to find charset of a page,but it does not
worked
>>>>>> for page
>>>>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon"
>>>>>> 
>>>>>> Code :
>>>>>>  [code]
>>>>>> 
>>>>>> try {
>>>>>> HttpClient httpclient = new DefaultHttpClient();
>>>>>> String
>>>>>> url="http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon";
>>>>>> HttpGet httpget = new HttpGet(url);
>>>>>> HttpResponse response;
>>>>>> response = httpclient.execute(httpget);
>>>>>> HttpEntity entity = response.getEntity();
>>>>>> if (entity != null) {
>>>>>> Header[] allHeaders = response.getHeaders("Content-Type");
>>>>>> System.out.println(allHeaders[0].getValue());
>>>>>> }
>>>>>> } catch (ClientProtocolException e) {
>>>>>> e.printStackTrace();
>>>>>> } catch (IOException e) {
>>>>>> e.printStackTrace();
>>>>>> }
>>>>>> [/code]
>>>>>> 
>>>>>> 
>>>>>> And the output of above code is : text/html.
>>>>>> But i think the output must be "text/html; charset=windows-1256"
.Am i
>>>>>> right?
>>>>>> 
>>>>>> But when i use
>>>>>> "http://bigbrowser.blog.lemonde.fr/2011/08/03/iran-le-mossad-derriere-le-meurtre-dun-scientifique-spiegel"
>>>>>> as a url in code,it returns "text/html; charset=UTF-8" ,that i think
,it
>>>>>> is OK.
>>>>>> It seems ,it works for some pages not all of them.Why this happens?
>>>>>> 
>>>>>> 
>>>>>> Khosro.
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
>>>>> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Stijn
>>>> stijn@ebisi.be
>>>> 
>>>> 
>>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
>> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>> 
>
>--------------------------
>Ken Krugler
>+1 530-210-6378
>http://bixolabs.com
>custom data mining solutions
>
>
>
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
>For additional commands, e-mail: httpclient-users-help@hc.apache.org
>
>
>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message