commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Saegesser <Marc.Saeges...@apropos.com>
Subject RE: [HttpClient]Encoding
Date Wed, 20 Mar 2002 19:27:48 GMT
I've had to deal with this problem myself.  Right now the only solution is
to use getResponseBody() and convert bytes into a string using the
appropriate encoding.  I like the idea of having getResponseBodyAsString()
use the encoding specified in the Content-Type header, but the problem is
that it still won't be very useful.

The vast majority of web servers out there don't include a "; charset="
attribute in the content-type header or provide a reasonable mechanism for
content authors to cause the server to set the attribute correctly on a
per-file basis.  Most pages with non-ISO-LATIN-1 charsets use <META
HTTP-EQUIV> tag in the HTML header to specify the page encoding.  That means
you still have to read at least part of the response body (as ISO-LATIN-1)
in order to determine the correct encoding.

I don't have a problem with changing getResponseBodyAsString() to check the
content-type header, I just doubt that doing that will make it much more
useful in the real world.

What do others think?

Marc Saegesser 

> -----Original Message-----
> From: Rapheal Kaplan [mailto:rafe@mimir.net]
> Sent: Wednesday, March 20, 2002 12:46 PM
> To: commons-dev@jakarta.apache.org
> Subject: [HttpClient]Encoding
> 
> 
>   Was working with a friend trying to determine the best way 
> to read the 
> contents of an HTTP response in to a string.  Since he's 
> working within the 
> Jakarta framework, including the HttpClient, we decided to 
> use that API.  The 
> simplest way seems to be:
> 
>   HttpClient hc = new HttpClient()
>   UrlGetMethod gm = new UrlGetMethod(query);
>   hc.startSession(url,80);
>   hc.executeMethod(gm);
> 
>   String htmlText = gm.getResponseBodyAsString();
> 
>   I thought that seemed like a good idea, and wanted to check 
> to make sure 
> that the encoding was working correctly in 
> getResponseBodyAsString.  I 
> noticed there is also "byte[] getResponseBody" and 
> getResponseBodyAsStream.  
> It doesn't seem like the getResponseBodyAsString would encode 
> the byte array 
> properly.  Here is how it is written in 
> org.apache.commons.httpclient.methods.GetMethod.java:
> 
>    /**
>     * Return my response body, if any,
>     * as a {@link String}.
>     * Otherwise return <tt>null</tt>.
>     */
>    public String getResponseBodyAsString() {
>        byte[] data = getResponseBody();
>        if(null == data) {
>            return null;
>        } else {
>            return new String(data);
>        }
>    }
> 
>   The problem is that the string is constructed using the 
> default encoding of 
> the VM, but not the encoding that the server might be sending 
> the data in.  
> For example, if the client is requesting a document written 
> in Chinese, it 
> could well use an entirely different encoding.
> 
>   Of course I am not worried about the getResponseBody and 
> getResponseBodyAsStream methods.  Those should expose binary 
> data.  However, 
> the get...AsString should do something like:
> 
>    /**
>     * Return my response body, if any,
>     * as a {@link String}.
>     * Otherwise return <tt>null</tt>.
>     */
>    public String getResponseBodyAsString() {
>        byte[] data = getResponseBody();
>        if(null == data) {
>            return null;
>        } else {
>            return new String(data, getResponseEncoding());
>        }
>    }
> 
>   Of course I am making up the method getResponseEncoding as 
> an example.
> 
>   Likewise, I would recommend a getResponseAsReader method 
> that would return 
> an InputStreamReader set to the proper encoding.
> 
>   Has anyone giving this problem any thought?  Or, is this 
> design intentional 
> and encoding is handled somewhere else?  Are there other issues?
> 
>   If there is a desire to solve the encoding problem 
> (assuming I am correct 
> in thinking it is missing), I am quite willing to participate 
> in the design 
> and encoding.
> 
>   Thank you.
> 
>   - Rapheal Kaplan
> 
> 
> 
> 
> --
> To unsubscribe, e-mail:   
> <mailto:commons-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: 
> <mailto:commons-dev-help@jakarta.apache.org>
> 

--
To unsubscribe, e-mail:   <mailto:commons-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:commons-dev-help@jakarta.apache.org>


Mime
View raw message