commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sung-Gu" <jeri...@apache.org>
Subject Re: [HttpClient]Encoding
Date Thu, 21 Mar 2002 18:18:19 GMT

I'm sure that you guys're talking about character set(= character encoding in MIME) in HTTP.
 I added my comment below.  ;)


Sung-Gu

----- Original Message ----- 
Subject: Re: [HttpClient]Encoding


> I'll see about changing getResponseBodyAsString() to use the charset from
> the content-type (if it exists).  I'm up to my ears with day job work right
> now, so it'll probably be a while before I can get to it.

I think we'll need to support language tags (within the Accept-Language and Content-Language
fields) and Accept and Content-Type (for internet media types) at some point.

> 
> People still need to understand (and I'll improve the JavaDoc) that
> getResponseBodyAsString() is never really going to be all that useful in the
> real world.  From HttpClient's perspective the response body is simply a
> sequence of bytes, nothing more.  It is up to a higher application layer to
> actually *interpret* those bytes based on the mime type specified in the
> content-type header.
> 
> Marc Saegesser 
> 
> > -----Original Message-----
> > From: Rapheal Kaplan [mailto:rafe@mimir.net]
> > Sent: Wednesday, March 20, 2002 1:53 PM
> > To: Jakarta Commons Developers List
> > Subject: Re: [HttpClient]Encoding
> > 
> > 
> >   Makes sense to me.  Because the encoding is handled in the 
> > body itself, it 
> > doesn't necessarily help that much to set the encoding in the 
> > getResponseBodyAsString method.  Also, this kind of means 
> > that you can't rely 
> > on the getResponseBodyAsString method for all purposes.  
> > There needs to be 
> > some other layer of a client application that manages encoding.
> > 
> >   I still see the use of get...AsString, of course.  It could 
> > be an inbetween 
> > step that is sent to a parser to determine actual encoding, 
> > but then you 
> > would need to return to the original byte stream anyway to 
> > re-string the 
> > body.  Maybe the documentation should reflect this information.
> > 
> >   Also, if people start using charset info in the future, it 
> > would probably 
> > be nice to provide support.  It might be that doing body to 
> > string conversion 
> > should be somewhere else in the API.  Any ideas?
> >
> >   My first guess would be to have a utility class that can do 
> > the correct 
> > encoding, from both the header and maybe even parsing the 
> > content.  However, 
> > I don't think I am framiliar enough with the API to say decisivly.
> > 
> >   I do know that such features might be very useful for some work 
> > that I need to do in the near future.  I am working one 
> > software that needs 
> > to interact with several languages with non-latin character sets.

In your pre-mail,
> For example, if the client is requesting a document written 
> in Chinese, it 
> could well use an entirely different encoding.

if you want to solve this problem in the only perspective of character encoding,
you should consider of the conversion from/to  local character set to/from transfer character
set in the client/server side. 

We can go more complicately!  
If you use mixed non-ascii characters (Korean and Chinese... ), you should provide to handle
to bi-directional display for these character sets.   Then you should take a two step process
for conversion from/to local character set to/from UTF-8?  First, convert the local character
set to the UCS.  Second, convert UCS to UTF-8. How complicated, huh?

And one more!
Some old clients or servers doesn't support 8 bit transfer encoding like UTF-8. Then what?
 We should check that the code is valid UTF-8 or not.


However, there is an eaiser way to solve this problem. 
( I WANT to say this a bit!  ^^ )
That's to use "escaped encoding" that includes ASCII character set only.
It looks like application/x-www-form-urlencoded for media type in HTML.
But it's somewhat different.

> > 
> >   - Rapheal Kaplan
> > 
> > 
> > 
> > On Wednesday 20 March 2002 14:27, you wrote:
> > > I've had to deal with this problem myself.  Right now the 
> > only solution is
> > > to use getResponseBody() and convert bytes into a string using the
> > > appropriate encoding.  I like the idea of having 
> > getResponseBodyAsString()
> > > use the encoding specified in the Content-Type header, but 
> > the problem is
> > > that it still won't be very useful.
> > >
> > > The vast majority of web servers out there don't include a 
> > "; charset="
> > > attribute in the content-type header or provide a 
> > reasonable mechanism for
> > > content authors to cause the server to set the attribute 
> > correctly on a
> > > per-file basis.  Most pages with non-ISO-LATIN-1 charsets use <META
> > > HTTP-EQUIV> tag in the HTML header to specify the page 
> > encoding.  That
> > > means you still have to read at least part of the response body (as
> > > ISO-LATIN-1) in order to determine the correct encoding.
> > >
> > > I don't have a problem with changing 
> > getResponseBodyAsString() to check the
> > > content-type header, I just doubt that doing that will make 
> > it much more
> > > useful in the real world.
> > >
> > > What do others think?
> > >
> > > Marc Saegesser
> > >
> > 
> 
> 
> 
Mime
View raw message