hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Kalnichevski <ol...@apache.org>
Subject Re: Charset trouble, questionmarks
Date Wed, 02 Sep 2009 11:45:42 GMT
On Wed, Sep 02, 2009 at 10:22:16AM +0200, Magnus Olstad Hansen wrote:
> Hello,
>
> I'm using HttpClient 4.0 to download a webpage the same way as shown in  
> one of the examples. This is my method to return a webpage as a string:
>
>        protected static String leechUrl(String url) throws IOException {
>                HttpClient httpclient = new DefaultHttpClient();
>                HttpGet httpget = new HttpGet(url);
>
>                System.out.println("executing request " + httpget.getURI());
>
>                // Create a response handler
>                ResponseHandler<String> responseHandler = new  
> BasicResponseHandler();
>                String responseBody = httpclient.execute(httpget,  
> responseHandler);
>
>                // When HttpClient instance is no longer needed,
>                // shut down the connection manager to ensure
>                // immediate deallocation of all system resources
>                httpclient.getConnectionManager().shutdown();
>                return responseBody;
>        }
>
> However; the responseBody returned here contains ? (questionmarks) for  
> all norwegian characters (??????) on the page. For example if I try to  
> dump "http://www.vg.no" I can find the following at line 107:
>
>        <li><a  
> href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue  
> *p?* veggen (blogg)</a></li>
>
> ...that questionmark there should've been the character ?.  For  
> certainty I've compared to the same page and line dumped with wget:
>
>        <li><a  
> href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue  
> p? veggen (blogg)</a></li>
>
> My question is simply what I need to do to keep the norwegian letters  
> intact? So far I've tried:
> - Copying BasicResponseHandler and debug that  
> EntityUtils.getContentCharset() finds a reasonable charset, it does.
> - Hacking EntityUtils.toString() to override both detected and default  
> charset with "ISO-8859-1" and "UTF-8".
> - Adding header to the request with content-type and charset (which  
> isn't really logical to add to a request, but I tried anyway)
>
> All I've accomplished with this is to sometimes get two ?'s instead of  
> one for the norwegian letters. I also tried to dump the response as  
> directly as I saw possible by using EntityUtils.toByteArray() and  
> writing directly to a file. To my surprise I can see that the ?'s are  
> still there and via hexdump I can see that they are all 3F  
> (questionmark) - so it's infact impossible to recover the norwegian  
> letters. They must have been replaced with a questionmark somewhere.
>
> Please advice, and a thousand thanks for reading my problem!
>
> Regards,
> Magnus

Hi Magnus,

Please use this guide to generate a wire / context log of the HTTP session and
post it to this list.

http://hc.apache.org/httpcomponents-client/logging.html

Oleg




---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Mime
View raw message