Return-Path: X-Original-To: apmail-hc-httpclient-users-archive@www.apache.org Delivered-To: apmail-hc-httpclient-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 469F68049 for ; Tue, 16 Aug 2011 15:41:30 +0000 (UTC) Received: (qmail 26532 invoked by uid 500); 16 Aug 2011 15:41:29 -0000 Delivered-To: apmail-hc-httpclient-users-archive@hc.apache.org Received: (qmail 26483 invoked by uid 500); 16 Aug 2011 15:41:29 -0000 Mailing-List: contact httpclient-users-help@hc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "HttpClient User Discussion" Delivered-To: mailing list httpclient-users@hc.apache.org Received: (qmail 26475 invoked by uid 99); 16 Aug 2011 15:41:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Aug 2011 15:41:29 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [98.139.213.127] (HELO nm2-vm0.bullet.mail.bf1.yahoo.com) (98.139.213.127) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 16 Aug 2011 15:41:23 +0000 Received: from [98.139.215.142] by nm2.bullet.mail.bf1.yahoo.com with NNFMP; 16 Aug 2011 15:41:01 -0000 Received: from [98.139.212.197] by tm13.bullet.mail.bf1.yahoo.com with NNFMP; 16 Aug 2011 15:41:01 -0000 Received: from [127.0.0.1] by omp1006.mail.bf1.yahoo.com with NNFMP; 16 Aug 2011 15:41:01 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 664220.65892.bm@omp1006.mail.bf1.yahoo.com Received: (qmail 63042 invoked by uid 60001); 16 Aug 2011 15:41:01 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1313509261; bh=AC5Eyt51r3zK9XjFas0DXr3zQAX3u8LKnkT0ovEZwI8=; h=X-YMail-OSG:Received:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=DnIQnYE3OW9i91Zbra1Z0hCog4dynLLUdGHr1G8Ih69J/d/M8+qFKwcwaK/ZdoZ8o0x0FXhS4fCHKaN2BU1uw68aOg9syNP5TF98frH7poV+FPZmti3y8G5wiPYgPzfBURD8yYsvIGpu/PDJY828JXbuR/0vaAUzVjUlZv9yJS4= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=kRfliRMgv80pufH69PFe5XF0WEMJWjs8z6/fERZyMN7er0/QALfj+TBr2+/apm3Ax5IzFG1GjN0F9RU9rvl/JCHd2PIlGEAhOuIadMkg0I8RfArMCVxLZWbZww9owo6xDpSD/goqLtb5EUQoUTeNSZupbDsSqSUCAav3UkftPvw=; X-YMail-OSG: 9OGXtJwVM1mo_PhfWfD931JFNPpJN_avlul15okr7pK5uw0 XWGHft4RicCEqYbhKGzBZLUvAwoB5Qj9mJBcQW6BwOZujIR4jxe8fqaR67s1 6jg5gO0D_4Z0trvPe1MEUA0LagE6FQY9AwiHQTJ0JmuPrCvYY3NOixYm1ubx nYFyj..nvkuRi1KzBjUrhynCIYHBzXFTnhV7VL2TOND2k0CacpfqA7PPFTZB 0q.GDnOPFmQjrYSjbCntySvB198OHMNzYp9FaYU1bXurBLPnuPkEOSqhv0Qx iW05Mws8gBreJKah0P4yikRoAImYQVZNuV2xuYsCftqAxnJnV4ensLFB.wTp XvfNBUswDDoVvI2hTCiKcOJrZVZ84QUEN6zxhALbTfNau_QLqLVrEw8LR9yr Ii6jx1LJXm9gMVVXW5WUcO5_3xfyYCgzUR4MxpbJIEKsqAe6jUAXRoi83zMR owgw4pDbFfC0xgldH0zPrQxHZqjKdp.0upMk5M5FVc3F6DK0oWqZkYQuSNnC 0TOt5lobKFDF1D6S9hQxDjAEnMAR8USIQmkKGe6pE0Zi0oAsScsS_MwVsYTu TgoBnMn_RmLxPyl0PAcvnz_FBlWka.Q2RmXTKD2OFaEKXtkwOJlOulYNwka3 67q.w.4o3f4nkcXkPldtPOxVctPh8mRSlEpmdc8FymMxv0PhEKO3CRpkBv_G ceT0A4BDXoI81ofvHauzIiN.Adnryv9a3Ly8trrH9YNhXt5cNXvqn2VKF_yO 5a1K8721XANkfo5Hz Received: from [92.50.13.14] by web161715.mail.bf1.yahoo.com via HTTP; Tue, 16 Aug 2011 08:41:01 PDT X-Mailer: YahooMailWebService/0.8.113.313619 References: <1313494927.1095.YahooMailNeo@web161710.mail.bf1.yahoo.com> <1313498430.65634.YahooMailNeo@web161711.mail.bf1.yahoo.com> <605A52AF-DA09-47DA-A542-5C4F75CC768F@transpac.com> Message-ID: <1313509261.61935.YahooMailNeo@web161715.mail.bf1.yahoo.com> Date: Tue, 16 Aug 2011 08:41:01 -0700 (PDT) From: Khosro Asgharifard Sharabiani Reply-To: Khosro Asgharifard Sharabiani Subject: Re: Obtaining charset of page from HttpResponse. To: HttpClient User Discussion In-Reply-To: <605A52AF-DA09-47DA-A542-5C4F75CC768F@transpac.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-1206148391-1313509261=:61935" --0-1206148391-1313509261=:61935 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Hi Ken,=0AMaybe using Tika is well ,but i have not used it and i must inves= tigate more about your approach.=0AAnyway ,i think=A0Stijn's approach to us= e=A0BufferedHttpEntity is useful for now.=0A=A0=0AKhosro.=0A=0A=0A>________= ________________________=0A>From: Ken Krugler = =0A>To: HttpClient User Discussion =0A>Sent= : Tuesday, August 16, 2011 6:27 PM=0A>Subject: Re: Obtaining charset of pag= e from HttpResponse.=0A>=0A>Hi Khosro,=0A>=0A>Detecting the charset for an = arbitrary HTML page is a non-trivial problem, and not something that is in = scope for HttpClient.=0A>=0A>E.g. sometimes the response header has no char= set, and there's nothing in the HTML tag.=0A>=0A>In that case, brows= ers (and web crawlers) use statistical analysis to guess at the appropriate= charset.=0A>=0A>One suggestion - you can use Tika to process a web page an= d detect the charset.=0A>=0A>-- Ken=0A>=0A>On Aug 16, 2011, at 6:07am, Jon = Moore wrote:=0A>=0A>> Hi Khosro,=0A>> =0A>> Stijn is saying that you need t= o parse the text/html response body and=0A>> look for the tag that c= ontains the charset. There are multiple=0A>> places the charset for an HTML= webpage can be specified: please see=0A>> the link that Stijn sent for mor= e details.=0A>> =0A>> Jon=0A>> =0A>> On Tue, Aug 16, 2011 at 8:40 AM, Khosr= o Asgharifard Sharabiani=0A>> wrote:=0A>>> Hi S= tijn :=0A>>> I also use entity.getContentEncoding() ,but it returns "null".= =0A>>> Is there any way to obtain charset of webpage?=0A>>> When we browse = this page from a browser like FF,it renders charset ,but when we request wi= th HttpClient or Curl ,we can not get charset?=0A>>> I think this is a big = problem ,when we have a crawler.Because when we crawl of webpage ,HttpClien= t gives us=A0 a stream,and we must know the charset of that webpage to save= it in Database,but it seems in some webpage ,we can not get charset of tha= t webpage.=0A>>> =0A>>> Khosro.=0A>>> =0A>>> =0A>>>> ______________________= __________=0A>>>> From: Stijn Deknudt =0A>>>> To: HttpClien= t User Discussion =0A>>>> Cc: Khosro Asghar= ifard Sharabiani =0A>>>> Sent: Tuesday, August 1= 6, 2011 4:38 PM=0A>>>> Subject: Re: Obtaining charset of page from HttpResp= onse.=0A>>>> =0A>>>> Hi Khosri,=0A>>>> =0A>>>> The Content-Type header is s= et (correctly) to "text/html", like Jon said.=0A>>>> There's no header in t= he response that says anything about the=0A>>>> character set, but you can = obtain this information from the entity=0A>>>> itself: the HTML contains th= e character set inside the meta tag:=0A>>>> =0A>>>> =0A>>>> See also h= ttp://www.w3.org/International/O-charset to get more=0A>>>> information abo= ut all different possibilities to declare the character=0A>>>> encodings.= =0A>>>> =0A>>>> Kind regards,=0A>>>> Stijn Deknudt.=0A>>>> =0A>>>> On 8/16/= 11, Jon Moore wrote:=0A>>>>> Hi,=0A>>>>> =0A>>>>> This is= because the resource at www.annahar.com that you link to=0A>>>>> returns a= Content-Type header that just reads "text/html":=0A>>>>> =0A>>>>> $ curl -= v=0A>>>>> "http://www.annahar.com/content.php?priority=3D1&table=3Dmain&typ= e=3Dmain&day=3DMon"=0A>>>>>> /dev/null=0A>>>>> * About to connect() to www.= annahar.com port 80 (#0)=0A>>>>> *=A0 Trying 66.242.155.235... connected= =0A>>>>> * Connected to www.annahar.com (66.242.155.235) port 80 (#0)=0A>>>= >>> GET /content.php?priority=3D1&table=3Dmain&type=3Dmain&day=3DMon HTTP/1= .1=0A>>>>>> User-Agent: curl/7.16.4 (i386-apple-darwin9.0) libcurl/7.16.4= =0A>>>>>> OpenSSL/0.9.7l zlib/1.2.3=0A>>>>>> Host: www.annahar.com=0A>>>>>>= Accept: */*=0A>>>>>> =0A>>>>> < HTTP/1.1 200 OK=0A>>>>> < Connection: clos= e=0A>>>>> < Date: Tue, 16 Aug 2011 11:50:50 GMT=0A>>>>> < Server: Microsoft= -IIS/6.0=0A>>>>> < X-Powered-By: ASP.NET=0A>>>>> < X-Powered-By: PHP/5.2.0= =0A>>>>> < Content-type: text/html=0A>>>>> <=0A>>>>>=A0 =A0 % Total=A0 =A0 = % Received % Xferd=A0 Average Speed=A0 Time=A0 =A0 Time=A0 =A0 Time=0A>>>= >> Current=0A>>>>>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 Dload=A0 Upload=A0 Total=A0 Spent=A0 =A0 Left=0A>>>>> Speed= =0A>>>>>=A0 =A0 0=A0 =A0 0=A0 =A0 0=A0 =A0 0=A0 =A0 0=A0 =A0 0=A0 =A0 = =A0 0=A0 =A0 =A0 0 --:--:-- --:--:--=0A>>>>> --:--:--=A0 =A0 0{ [data not = shown]=0A>>>>> 100 91340=A0 =A0 0 91340=A0 =A0 0=A0 =A0 0=A0 187k=A0 =A0 = =A0 0 --:--:-- --:--:--=0A>>>>> --:--:--=A0 237k* Closing connection #0=0A>= >>>> =0A>>>>> So httpclient is doing the right thing -- it's giving you acc= ess to=0A>>>>> exactly what's in the header that's returned.=0A>>>>> =0A>>>= >> Jon=0A>>>>> =0A>>>>> =0A>>>>> On Tue, Aug 16, 2011 at 7:42 AM, Khosro As= gharifard Sharabiani=0A>>>>> wrote:=0A>>>>>> He= llo,=0A>>>>>> I use the following code to find charset of a page,but it doe= s not worked=0A>>>>>> for page=0A>>>>>> "http://www.annahar.com/content.php= ?priority=3D1&table=3Dmain&type=3Dmain&day=3DMon"=0A>>>>>> =0A>>>>>> Code := =0A>>>>>>=A0 [code]=0A>>>>>> =0A>>>>>> try {=0A>>>>>> HttpClient httpclient= =3D new DefaultHttpClient();=0A>>>>>> String=0A>>>>>> url=3D"http://www.an= nahar.com/content.php?priority=3D1&table=3Dmain&type=3Dmain&day=3DMon";=0A>= >>>>> HttpGet httpget =3D new HttpGet(url);=0A>>>>>> HttpResponse response;= =0A>>>>>> response =3D httpclient.execute(httpget);=0A>>>>>> HttpEntity ent= ity =3D response.getEntity();=0A>>>>>> if (entity !=3D null) {=0A>>>>>> Hea= der[] allHeaders =3D response.getHeaders("Content-Type");=0A>>>>>> System.o= ut.println(allHeaders[0].getValue());=0A>>>>>> }=0A>>>>>> } catch (ClientPr= otocolException e) {=0A>>>>>> e.printStackTrace();=0A>>>>>> } catch (IOExce= ption e) {=0A>>>>>> e.printStackTrace();=0A>>>>>> }=0A>>>>>> [/code]=0A>>>>= >> =0A>>>>>> =0A>>>>>> And the output of above code is : text/html.=0A>>>>>= > But i think the output must be "text/html; charset=3Dwindows-1256" .Am i= =0A>>>>>> right?=0A>>>>>> =0A>>>>>> But when i use=0A>>>>>> "http://bigbrow= ser.blog.lemonde.fr/2011/08/03/iran-le-mossad-derriere-le-meurtre-dun-scien= tifique-spiegel"=0A>>>>>> as a url in code,it returns "text/html; charset= =3DUTF-8" ,that i think ,it=0A>>>>>> is OK.=0A>>>>>> It seems ,it works for= some pages not all of them.Why this happens?=0A>>>>>> =0A>>>>>> =0A>>>>>> = Khosro.=0A>>>>> =0A>>>>> --------------------------------------------------= -------------------=0A>>>>> To unsubscribe, e-mail: httpclient-users-unsubs= cribe@hc.apache.org=0A>>>>> For additional commands, e-mail: httpclient-use= rs-help@hc.apache.org=0A>>>>> =0A>>>>> =0A>>>> =0A>>>> =0A>>>> --=0A>>>> St= ijn=0A>>>> stijn@ebisi.be=0A>>>> =0A>>>> =0A>>>> =0A>> =0A>> --------------= -------------------------------------------------------=0A>> To unsubscribe= , e-mail: httpclient-users-unsubscribe@hc.apache.org=0A>> For additional co= mmands, e-mail: httpclient-users-help@hc.apache.org=0A>> =0A>=0A>----------= ----------------=0A>Ken Krugler=0A>+1 530-210-6378=0A>http://bixolabs.com= =0A>custom data mining solutions=0A>=0A>=0A>=0A>=0A>=0A>=0A>=0A>-----------= ----------------------------------------------------------=0A>To unsubscrib= e, e-mail: httpclient-users-unsubscribe@hc.apache.org=0A>For additional com= mands, e-mail: httpclient-users-help@hc.apache.org=0A>=0A>=0A>=0A> --0-1206148391-1313509261=:61935--