hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Magnus Olstad Hansen <ma...@magge.no>
Subject Charset trouble, questionmarks
Date Wed, 02 Sep 2009 08:22:16 GMT
Hello,

I'm using HttpClient 4.0 to download a webpage the same way as shown in 
one of the examples. This is my method to return a webpage as a string:

        protected static String leechUrl(String url) throws IOException {
                HttpClient httpclient = new DefaultHttpClient();
                HttpGet httpget = new HttpGet(url);

                System.out.println("executing request " + httpget.getURI());

                // Create a response handler
                ResponseHandler<String> responseHandler = new 
BasicResponseHandler();
                String responseBody = httpclient.execute(httpget, 
responseHandler);

                // When HttpClient instance is no longer needed,
                // shut down the connection manager to ensure
                // immediate deallocation of all system resources
                httpclient.getConnectionManager().shutdown();
                return responseBody;
        }

However; the responseBody returned here contains ? (questionmarks) for 
all norwegian characters (æøåÆØÅ) on the page. For example if I try to 
dump "http://www.vg.no" I can find the following at line 107:

        <li><a 
href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue 
*p?* veggen (blogg)</a></li>

...that questionmark there should've been the character å.  For 
certainty I've compared to the same page and line dumped with wget:

        <li><a 
href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue 
på veggen (blogg)</a></li>

My question is simply what I need to do to keep the norwegian letters 
intact? So far I've tried:
- Copying BasicResponseHandler and debug that 
EntityUtils.getContentCharset() finds a reasonable charset, it does.
- Hacking EntityUtils.toString() to override both detected and default 
charset with "ISO-8859-1" and "UTF-8".
- Adding header to the request with content-type and charset (which 
isn't really logical to add to a request, but I tried anyway)

All I've accomplished with this is to sometimes get two ?'s instead of 
one for the norwegian letters. I also tried to dump the response as 
directly as I saw possible by using EntityUtils.toByteArray() and 
writing directly to a file. To my surprise I can see that the ?'s are 
still there and via hexdump I can see that they are all 3F 
(questionmark) - so it's infact impossible to recover the norwegian 
letters. They must have been replaced with a questionmark somewhere.

Please advice, and a thousand thanks for reading my problem!

Regards,
Magnus

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message