Return-Path: X-Original-To: apmail-hc-httpclient-users-archive@www.apache.org Delivered-To: apmail-hc-httpclient-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6CF4886AC for ; Tue, 16 Aug 2011 13:57:48 +0000 (UTC) Received: (qmail 37376 invoked by uid 500); 16 Aug 2011 13:57:48 -0000 Delivered-To: apmail-hc-httpclient-users-archive@hc.apache.org Received: (qmail 37283 invoked by uid 500); 16 Aug 2011 13:57:47 -0000 Mailing-List: contact httpclient-users-help@hc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "HttpClient User Discussion" Delivered-To: mailing list httpclient-users@hc.apache.org Received: (qmail 37275 invoked by uid 99); 16 Aug 2011 13:57:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Aug 2011 13:57:47 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.11.129.193] (HELO mta00.prxy.net) (209.11.129.193) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Aug 2011 13:57:40 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by mta00.prxy.net (Postfix) with ESMTP id 0E614C7738 for ; Tue, 16 Aug 2011 06:56:45 -0700 (PDT) X-Virus-Scanned: amavisd-new at mta00.prxy.net Received: from mta00.prxy.net ([127.0.0.1]) by localhost (mta00.prxy.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ed1mNMvfpFdr for ; Tue, 16 Aug 2011 06:56:44 -0700 (PDT) Received: from [28.248.83.43] (66-87-7-43.pools.spcsdns.net [66.87.7.43]) by mta00.prxy.net (Postfix) with ESMTPSA id 08928C771F for ; Tue, 16 Aug 2011 06:56:42 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1084) Subject: Re: Obtaining charset of page from HttpResponse. From: Ken Krugler In-Reply-To: Date: Tue, 16 Aug 2011 06:57:01 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <605A52AF-DA09-47DA-A542-5C4F75CC768F@transpac.com> References: <1313494927.1095.YahooMailNeo@web161710.mail.bf1.yahoo.com> <1313498430.65634.YahooMailNeo@web161711.mail.bf1.yahoo.com> To: "HttpClient User Discussion" X-Mailer: Apple Mail (2.1084) X-Virus-Checked: Checked by ClamAV on apache.org Hi Khosro, Detecting the charset for an arbitrary HTML page is a non-trivial = problem, and not something that is in scope for HttpClient. E.g. sometimes the response header has no charset, and there's nothing = in the HTML tag. In that case, browsers (and web crawlers) use statistical analysis to = guess at the appropriate charset. One suggestion - you can use Tika to process a web page and detect the = charset. -- Ken On Aug 16, 2011, at 6:07am, Jon Moore wrote: > Hi Khosro, >=20 > Stijn is saying that you need to parse the text/html response body and > look for the tag that contains the charset. There are multiple > places the charset for an HTML webpage can be specified: please see > the link that Stijn sent for more details. >=20 > Jon >=20 > On Tue, Aug 16, 2011 at 8:40 AM, Khosro Asgharifard Sharabiani > wrote: >> Hi Stijn : >> I also use entity.getContentEncoding() ,but it returns "null". >> Is there any way to obtain charset of webpage? >> When we browse this page from a browser like FF,it renders charset = ,but when we request with HttpClient or Curl ,we can not get charset? >> I think this is a big problem ,when we have a crawler.Because when we = crawl of webpage ,HttpClient gives us a stream,and we must know the = charset of that webpage to save it in Database,but it seems in some = webpage ,we can not get charset of that webpage. >>=20 >> Khosro. >>=20 >>=20 >>> ________________________________ >>> From: Stijn Deknudt >>> To: HttpClient User Discussion >>> Cc: Khosro Asgharifard Sharabiani >>> Sent: Tuesday, August 16, 2011 4:38 PM >>> Subject: Re: Obtaining charset of page from HttpResponse. >>>=20 >>> Hi Khosri, >>>=20 >>> The Content-Type header is set (correctly) to "text/html", like Jon = said. >>> There's no header in the response that says anything about the >>> character set, but you can obtain this information from the entity >>> itself: the HTML contains the character set inside the meta tag: >>> >>>=20 >>> See also http://www.w3.org/International/O-charset to get more >>> information about all different possibilities to declare the = character >>> encodings. >>>=20 >>> Kind regards, >>> Stijn Deknudt. >>>=20 >>> On 8/16/11, Jon Moore wrote: >>>> Hi, >>>>=20 >>>> This is because the resource at www.annahar.com that you link to >>>> returns a Content-Type header that just reads "text/html": >>>>=20 >>>> $ curl -v >>>> = "http://www.annahar.com/content.php?priority=3D1&table=3Dmain&type=3Dmain&= day=3DMon" >>>>> /dev/null >>>> * About to connect() to www.annahar.com port 80 (#0) >>>> * Trying 66.242.155.235... connected >>>> * Connected to www.annahar.com (66.242.155.235) port 80 (#0) >>>>> GET /content.php?priority=3D1&table=3Dmain&type=3Dmain&day=3DMon = HTTP/1.1 >>>>> User-Agent: curl/7.16.4 (i386-apple-darwin9.0) libcurl/7.16.4 >>>>> OpenSSL/0.9.7l zlib/1.2.3 >>>>> Host: www.annahar.com >>>>> Accept: */* >>>>>=20 >>>> < HTTP/1.1 200 OK >>>> < Connection: close >>>> < Date: Tue, 16 Aug 2011 11:50:50 GMT >>>> < Server: Microsoft-IIS/6.0 >>>> < X-Powered-By: ASP.NET >>>> < X-Powered-By: PHP/5.2.0 >>>> < Content-type: text/html >>>> < >>>> % Total % Received % Xferd Average Speed Time Time = Time >>>> Current >>>> Dload Upload Total Spent = Left >>>> Speed >>>> 0 0 0 0 0 0 0 0 --:--:-- --:--:-- >>>> --:--:-- 0{ [data not shown] >>>> 100 91340 0 91340 0 0 187k 0 --:--:-- --:--:-- >>>> --:--:-- 237k* Closing connection #0 >>>>=20 >>>> So httpclient is doing the right thing -- it's giving you access to >>>> exactly what's in the header that's returned. >>>>=20 >>>> Jon >>>>=20 >>>>=20 >>>> On Tue, Aug 16, 2011 at 7:42 AM, Khosro Asgharifard Sharabiani >>>> wrote: >>>>> Hello, >>>>> I use the following code to find charset of a page,but it does not = worked >>>>> for page >>>>> = "http://www.annahar.com/content.php?priority=3D1&table=3Dmain&type=3Dmain&= day=3DMon" >>>>>=20 >>>>> Code : >>>>> [code] >>>>>=20 >>>>> try { >>>>> HttpClient httpclient =3D new DefaultHttpClient(); >>>>> String >>>>> = url=3D"http://www.annahar.com/content.php?priority=3D1&table=3Dmain&type=3D= main&day=3DMon"; >>>>> HttpGet httpget =3D new HttpGet(url); >>>>> HttpResponse response; >>>>> response =3D httpclient.execute(httpget); >>>>> HttpEntity entity =3D response.getEntity(); >>>>> if (entity !=3D null) { >>>>> Header[] allHeaders =3D response.getHeaders("Content-Type"); >>>>> System.out.println(allHeaders[0].getValue()); >>>>> } >>>>> } catch (ClientProtocolException e) { >>>>> e.printStackTrace(); >>>>> } catch (IOException e) { >>>>> e.printStackTrace(); >>>>> } >>>>> [/code] >>>>>=20 >>>>>=20 >>>>> And the output of above code is : text/html. >>>>> But i think the output must be "text/html; charset=3Dwindows-1256" = .Am i >>>>> right? >>>>>=20 >>>>> But when i use >>>>> = "http://bigbrowser.blog.lemonde.fr/2011/08/03/iran-le-mossad-derriere-le-m= eurtre-dun-scientifique-spiegel" >>>>> as a url in code,it returns "text/html; charset=3DUTF-8" ,that i = think ,it >>>>> is OK. >>>>> It seems ,it works for some pages not all of them.Why this = happens? >>>>>=20 >>>>>=20 >>>>> Khosro. >>>>=20 >>>> = --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org >>>> For additional commands, e-mail: = httpclient-users-help@hc.apache.org >>>>=20 >>>>=20 >>>=20 >>>=20 >>> -- >>> Stijn >>> stijn@ebisi.be >>>=20 >>>=20 >>>=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org > For additional commands, e-mail: httpclient-users-help@hc.apache.org >=20 -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions --------------------------------------------------------------------- To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org For additional commands, e-mail: httpclient-users-help@hc.apache.org