hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 心藍 Dennis <hkdenni...@gmail.com>
Subject Re: Japanese charset?
Date Thu, 16 Jun 2005 07:47:13 GMT
In fact getResponseCharSet() did not give the right charset when http 
header(not html header) don't have it. It does not read the file content.

I think it should return null or exception if no charset in http content 
type, but it return ISO_8859.

My way is, always check getResponseContentType header contains 
getResponseCharSet. If not, load the html into byte[], and then using 
jchardet / analyze html header myself.


On 6/16/05, Roland Weber <ROLWEBER@de.ibm.com> wrote:
> 
> Hello Andrew,
> 
> sorry that my mail yestarday took 9 hours to get to the list.
> I hope this one appears in a timely manner :-)
> 
> 
> "Andrew A. Sabitov" <sabitov@catalysis.nsk.su> wrote on 16.06.2005
> 03:04:05:
> 
> > Server sends Shift_JIS as page charset.
> >
> > it's my code now:
> >
> > ............
> > result = new HttpResponse ( method.getResponseBodyAsStream (),
> > method.getResponseCharSet() );
> > .........
> >
> > //in HttpResponse constructor:
> > HttpResponse ( InputStream responseBodyAsStream, String charset )
> > throws IOException {
> > BufferedReader reader = new BufferedReader ( new
> > InputStreamReader ( responseBodyAsStream, charset ) );
> > String line = null;
> > while ( ( line = reader.readLine() ) != null ) {
> > this.add( line );
> > out.write( line );
> > out.write( "\n" );
> > }
> >
> > }
> >
> > It works. :)
> >
> > It's funny, but http://jakarta.apache.org/commons/httpclient/3.
> > 0/charencodings.html
> > says: "If the response is known to be a String, you can use the
> > getResponseBodyAsString method which will automatically use the encoding
> 
> > specified in the Content-Type header or ISO-8859-1 if no charset is
> > specified."
> >
> > Content-Type for this page is "text/html; charset=Shift_JIS", I realy
> > thought that httpclient autocovert body... :(
> >
> 
> I've checked the code for 3.0. Here are the relevant fragments:
> 
> 
> http://svn.apache.org/repos/asf/jakarta/commons/proper/httpclient/trunk/src/java/org/apache/commons/httpclient/HttpMethodBase.java
> method getResponseBodyAsString:
> byte[] rawdata...
> ... = getResponseBody()
> ...
> return EncodingUtil.getString(rawdata, getResponseCharSet())
> ...
> 
> 
> 
> http://svn.apache.org/repos/asf/jakarta/commons/proper/httpclient/trunk/src/java/org/apache/commons/httpclient/util/EncodingUtil.java
> method getString(byte[],int,int,String):
> 
> ... return new String(data, offset, length, charset)
> ... LOG.warn("Unsupported encoding: " + charset + ". System
> encoding used");
> return new String(data, offset, length);
> 
> I wonder whether the InputStreamReader recognizes charsets that the String
> constructor doesn't? But why should it? And why wouldn't you get the
> warning?
> Something is fishy here.
> 
> cheers,
> Roland
> 
> 
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message