Return-Path: Delivered-To: apmail-hc-httpclient-users-archive@www.apache.org Received: (qmail 60710 invoked from network); 5 Sep 2009 10:38:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 5 Sep 2009 10:38:15 -0000 Received: (qmail 76809 invoked by uid 500); 5 Sep 2009 10:38:15 -0000 Delivered-To: apmail-hc-httpclient-users-archive@hc.apache.org Received: (qmail 76725 invoked by uid 500); 5 Sep 2009 10:38:15 -0000 Mailing-List: contact httpclient-users-help@hc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "HttpClient User Discussion" Delivered-To: mailing list httpclient-users@hc.apache.org Received: (qmail 76714 invoked by uid 99); 5 Sep 2009 10:38:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 05 Sep 2009 10:38:14 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 05 Sep 2009 10:38:04 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1MjseO-0001gf-8h for httpclient-users@hc.apache.org; Sat, 05 Sep 2009 03:37:44 -0700 Message-ID: <25307019.post@talk.nabble.com> Date: Sat, 5 Sep 2009 03:37:44 -0700 (PDT) From: MaGGE To: httpclient-users@hc.apache.org Subject: Re: Charset trouble, questionmarks In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: magnus@magge.no References: <4A9E2B38.8080000@magge.no> <4A9E881C.5030406@magge.no> X-Virus-Checked: Checked by ClamAV on apache.org Hello again Ken, Sorry to lag behind on the replies - work is busy these days... :) Seems you're right. I've made a custom ResponseHandler class to be able to dump the raw output from HttpClient. However, I'd used FileWriter/BufferedWriter to dump to my file. This must've tried to interpret charset also, causing the bothersome 0x3F's mentioned before. Your tip about another HttpClient app returning the content successfully caused me to look at my method again - and I made the output via FileOutputStream,write(byte[]) instead. Using hexdump as before I can now confirm that there's no longer a 0x3F but 0xC3 0xA5 as it should be. (...from wget) # hexdump -s 0x1845 -C index.html | head -n 2 00001845 70 c3 a5 20 76 65 67 67 65 6e 20 28 62 6c 6f 67 |p.. veggen (blog| 00001855 67 29 3c 2f 61 3e 3c 2f 6c 69 3e 0a 09 09 3c 6c |g) ...... > Hi Magnus, > > I used curl to grab the file, and the bytes at 0x1845...0x1847 are > 0xC3 0xA5, which is valid UTF-8 for the u00E5 code point (latin small > letter a with ring above). > > I also used Bixo (http://bixo.101tec.com) to crawl the same page, and > wound up with the same raw data. Bixo uses HttpClient 4.0, so it's a > good test. > > Given what you've tried (in your initial email), I've only got one > weak guess - that your tools are showing you stuff that isn't actually > there. > -- View this message in context: http://www.nabble.com/Charset-trouble%2C-questionmarks-tp25253439p25307019.html Sent from the HttpClient-User mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org For additional commands, e-mail: httpclient-users-help@hc.apache.org