hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Kalnichevski <ol...@apache.org>
Subject Re: Charset trouble, questionmarks
Date Sat, 05 Sep 2009 10:57:40 GMT
MaGGE wrote:
> Hello again Ken,
> 
> Sorry to lag behind on the replies - work is busy these days... :)
> 
> Seems you're right. I've made a custom ResponseHandler class to be able to
> dump the raw output from HttpClient. However, I'd used
> FileWriter/BufferedWriter to dump to my file. This must've tried to
> interpret charset also, causing the bothersome 0x3F's mentioned before. 
> 
> Your tip about another HttpClient app returning the content successfully
> caused me to look at my method again - and I made the output via
> FileOutputStream,write(byte[]) instead. Using hexdump as before I can now
> confirm that there's no longer a 0x3F but 0xC3 0xA5 as it should be.
> 
> (...from wget)
> # hexdump -s 0x1845 -C index.html | head -n 2
> 00001845  70 c3 a5 20 76 65 67 67  65 6e 20 28 62 6c 6f 67  |p.. veggen
> (blog|
> 00001855  67 29 3c 2f 61 3e 3c 2f  6c 69 3e 0a 09 09 3c 6c  |g) </li>...<l|
> 
> (...from my dump)
> # hexdump -s 0x1845 -C raw.txt | head -n 2
> 00001845  70 c3 a5 20 76 65 67 67  65 6e 20 28 62 6c 6f 67  |p.. veggen
> (blog|
> 00001855  67 29 3c 2f 61 3e 3c 2f  6c 69 3e 0a 09 09 3c 6c  |g) </li>...<l|
> 
> What remains a mystery to me is, however, why the string returned from
> HttpClient.execute() and thus EntityUtils.toString(Entity) does not
> represent the letters correctly. I also tested this with the
> BasicResponseHandler to be sure it was nothing I'd done.
> 
> Atleast now I can use my custom ResponseHandler and figure out how to treat
> the intact byte-array correctly. So thanks a lot! :)
> 
> 

If you only listened and produced a wire / context log, when I asked 
you, all this could have been found out much earlier, and I most likely 
would also have been able to tell why EntityUtils#toString failed to 
detect the charset.

Oleg



> 
> Ken Krugler wrote:
>> Hi Magnus,
>>
>> I used curl to grab the file, and the bytes at 0x1845...0x1847 are  
>> 0xC3 0xA5, which is valid UTF-8 for the u00E5 code point (latin small  
>> letter a with ring above).
>>
>> I also used Bixo (http://bixo.101tec.com) to crawl the same page, and  
>> wound up with the same raw data. Bixo uses HttpClient 4.0, so it's a  
>> good test.
>>
>> Given what you've tried (in your initial email), I've only got one  
>> weak guess - that your tools are showing you stuff that isn't actually  
>> there.
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Mime
View raw message