hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MaGGE <mag...@magge.no>
Subject Re: Charset trouble, questionmarks
Date Sat, 05 Sep 2009 10:37:44 GMT

Hello again Ken,

Sorry to lag behind on the replies - work is busy these days... :)

Seems you're right. I've made a custom ResponseHandler class to be able to
dump the raw output from HttpClient. However, I'd used
FileWriter/BufferedWriter to dump to my file. This must've tried to
interpret charset also, causing the bothersome 0x3F's mentioned before. 

Your tip about another HttpClient app returning the content successfully
caused me to look at my method again - and I made the output via
FileOutputStream,write(byte[]) instead. Using hexdump as before I can now
confirm that there's no longer a 0x3F but 0xC3 0xA5 as it should be.

(...from wget)
# hexdump -s 0x1845 -C index.html | head -n 2
00001845  70 c3 a5 20 76 65 67 67  65 6e 20 28 62 6c 6f 67  |p.. veggen
00001855  67 29 3c 2f 61 3e 3c 2f  6c 69 3e 0a 09 09 3c 6c  |g) </li>...<l|

(...from my dump)
# hexdump -s 0x1845 -C raw.txt | head -n 2
00001845  70 c3 a5 20 76 65 67 67  65 6e 20 28 62 6c 6f 67  |p.. veggen
00001855  67 29 3c 2f 61 3e 3c 2f  6c 69 3e 0a 09 09 3c 6c  |g) </li>...<l|

What remains a mystery to me is, however, why the string returned from
HttpClient.execute() and thus EntityUtils.toString(Entity) does not
represent the letters correctly. I also tested this with the
BasicResponseHandler to be sure it was nothing I'd done.

Atleast now I can use my custom ResponseHandler and figure out how to treat
the intact byte-array correctly. So thanks a lot! :)

Ken Krugler wrote:
> Hi Magnus,
> I used curl to grab the file, and the bytes at 0x1845...0x1847 are  
> 0xC3 0xA5, which is valid UTF-8 for the u00E5 code point (latin small  
> letter a with ring above).
> I also used Bixo (http://bixo.101tec.com) to crawl the same page, and  
> wound up with the same raw data. Bixo uses HttpClient 4.0, so it's a  
> good test.
> Given what you've tried (in your initial email), I've only got one  
> weak guess - that your tools are showing you stuff that isn't actually  
> there.

View this message in context: http://www.nabble.com/Charset-trouble%2C-questionmarks-tp25253439p25307019.html
Sent from the HttpClient-User mailing list archive at Nabble.com.

To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org

View raw message