tomcat-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vincent Schonau <vince-jaka...@netnautics.com>
Subject Re: [PATCH] '8859_1' is not a valid charset alias
Date Sun, 20 May 2001 07:02:59 GMT
On Sat, May 19, 2001 at 03:19:09PM -0700, cmanolache@yahoo.com wrote:
> Vicent, Forrest,
> 
> Thanks for the patch & review. 
> 
> Could you summarize and/or expand a bit :-) ? 

The changes I made affect two uses of the concept of Character encodings:

  1 what's being sent to the browser (ie in JspParseEventListener) as
    HTTP headers (as literals)
  2 what's being used to set the CharacterEncoding of input and output 
    streams

The reason I made the patch is that an (older) version of Lynx that I use to
test apps barfed on the "text/html; charset=8859_1" header. I noticed this
was non-standard, and that it's all over the tree; hence the patch.
It's just a standards thing. ("iso-8859-1" is the 'preferred mime name' for
this charset; see the IANA charset list that I pointed too). That's category
1.

Forrest then pointed out that the code I touched affect the selection of
encodings in Java, and that there is a performance gain to be had.

I did a little investigation into Forrests remarks, and it turns out that
_consistently_ using something other than what Java looks at as name of the
encoding of a string can have an enormous impact (on my benchmark); using
the canonical name ("ISO8859_1") instead of some alias ("ISO-8859-1" or
"8859_1") can cause a performance win of up to 20x (!) If one looks up the
canonical name of the charset before accessing a String with a non-default
encoding, the total cost is only 1.5x the cost of accessing it with the
encodings canonical name.

I've looked at the 3.x tree, and from superficial tests, it looks like this
specific code is hardly ever reached by tomcat, so optimising it may not, in
fact, do any good for anyone using iso-8895-1 for most content &
user-agents. Most of the work is already done, so I'll do it anyway.

That's category 2. (patch coming up).

> Also, does anyone played with the various browsers ? Is any browser
> sending the charset encoding ? What format ? 

I've been playing with this, but I don't have any definite results. As part
of the work for issue 2 above, I'll be testing this.

There isn't actually any reference to charsets used in the request in
Servler 2.2; but there is in 2.3 (SRV.4.9 Request data encoding). (they say
there that there aren't many browsers sending Content-Encoding with the
request, currently).

> I know that some browsers are encoding the URL with the same charset that
> is used in the page, while some are using UTF ( there was discussion about
> that somewhere). 

If you have a reference to this, I'll be happy to look into it.

> Is it true that browsers that are using UTF ( like IE on NT ? ) do send
> the body as UTF ? Do they set the Charset-Encoding header ?
> 
> I would really apreciate some info ( I don't use Windows, and I heard
> there are differences between IE/Win9x and IE/NT )

I have no data on this yet, but I will soon.


Hope this helps,


Vince.

Mime
View raw message