tomcat-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Forrest R. Girouard" <Forrest.Girou...@openwave.com>
Subject Re: [PATCH] '8859_1' is not a valid charset alias
Date Mon, 21 May 2001 01:53:49 GMT
Costin:

I'm not yet familiar with the Tomcat or Jasper code (and I've only
been on this list for a couple weeks) but in general I concur with 
Vince's analysis.  I can corroborate his benchmark testing since 
I've seen it contribute to performance problems under very heavy
load with a large number of threads (200).  I'm baffled why the 
Java implementors allowed any synchronization in such a fundamental 
class as String.

Furthermore, it has been my experience that it is necessary to 
internally map between the IANA character set names and the Java 
encoding names (I have no idea why the Java implementors chose to 
use non-standard encoding names).  

I have also never seen the Content-Encoding in use and rarely have I 
seen the charset specified as part of the Content-Type (I've heard 
that some user-agents and some servers choke on it).   Browsers 
(user agents) almost universally post content in the encoding of the 
original document.

For applications I recommend strictly using a single encoding per 
session and limiting query parameters to US-ASCII.  Of course, 
Tomcat needs to do something reasonable regardless.

Cheers,
	Forrest

Vincent Schonau wrote:
> 
> On Sat, May 19, 2001 at 03:19:09PM -0700, cmanolache@yahoo.com wrote:
> > Vicent, Forrest,
> >
> > Thanks for the patch & review.
> >
> > Could you summarize and/or expand a bit :-) ?
> 
> The changes I made affect two uses of the concept of Character encodings:
> 
>   1 what's being sent to the browser (ie in JspParseEventListener) as
>     HTTP headers (as literals)
>   2 what's being used to set the CharacterEncoding of input and output
>     streams
> 
> The reason I made the patch is that an (older) version of Lynx that I use to
> test apps barfed on the "text/html; charset=8859_1" header. I noticed this
> was non-standard, and that it's all over the tree; hence the patch.
> It's just a standards thing. ("iso-8859-1" is the 'preferred mime name' for
> this charset; see the IANA charset list that I pointed too). That's category
> 1.
> 
> Forrest then pointed out that the code I touched affect the selection of
> encodings in Java, and that there is a performance gain to be had.
> 
> I did a little investigation into Forrests remarks, and it turns out that
> _consistently_ using something other than what Java looks at as name of the
> encoding of a string can have an enormous impact (on my benchmark); using
> the canonical name ("ISO8859_1") instead of some alias ("ISO-8859-1" or
> "8859_1") can cause a performance win of up to 20x (!) If one looks up the
> canonical name of the charset before accessing a String with a non-default
> encoding, the total cost is only 1.5x the cost of accessing it with the
> encodings canonical name.
> 
> I've looked at the 3.x tree, and from superficial tests, it looks like this
> specific code is hardly ever reached by tomcat, so optimising it may not, in
> fact, do any good for anyone using iso-8895-1 for most content &
> user-agents. Most of the work is already done, so I'll do it anyway.
> 
> That's category 2. (patch coming up).
> 
> > Also, does anyone played with the various browsers ? Is any browser
> > sending the charset encoding ? What format ?
> 
> I've been playing with this, but I don't have any definite results. As part
> of the work for issue 2 above, I'll be testing this.
> 
> There isn't actually any reference to charsets used in the request in
> Servler 2.2; but there is in 2.3 (SRV.4.9 Request data encoding). (they say
> there that there aren't many browsers sending Content-Encoding with the
> request, currently).
> 
> > I know that some browsers are encoding the URL with the same charset that
> > is used in the page, while some are using UTF ( there was discussion about
> > that somewhere).
> 
> If you have a reference to this, I'll be happy to look into it.
> 
> > Is it true that browsers that are using UTF ( like IE on NT ? ) do send
> > the body as UTF ? Do they set the Charset-Encoding header ?
> >
> > I would really apreciate some info ( I don't use Windows, and I heard
> > there are differences between IE/Win9x and IE/NT )
> 
> I have no data on this yet, but I will soon.
> 
> Hope this helps,
> 
> Vince.

-- 
Forrest Girouard @ Openwave Systems Inc.
phone: +1-650-817-1556
mailto:Forrest.Girouard@openwave.com
http://www.openwave.com



Mime
View raw message