tomcat-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject DO NOT REPLY [Bug 51400] Use of "new String(byte[] b, String enc)" hits Sun JVM bottleneck
Date Thu, 23 Jun 2011 15:25:01 GMT

--- Comment #11 from Konstantin Prei├čer <> 2011-06-23 15:25:01 UTC ---
Hi Christopher,

(In reply to comment #10)
> If you read some of the online posts linked from this BZ issue, you'll see
> claims that pre-populating such a cache does not have a noticeable impact on
> performance. Honestly, I'm okay not pre-populating things because there are
> probably a dozen encodings that get any significant amount of real use on the
> web, while Charset.availableCharsets returns 163 different obscure character
> sets.
> I suppose it's a fairly small set of encodings, but with little benefit,
> there's no reason IMO to pre-populate.
You're right; however if I read the reports correctly, this is true if charsets
with valid names only are used for the lookup. But everytime when there is a
loopkup for a non-existing Charset, the JVM-synchronized Charset.lookup() is
called. Probably to speed this up, Konstantin Kolinko suggested to cache
charset missings.

If a list with all avaliable charsets would be pre-populated, including their
aliases, missing charsets could also be determined fast. 

> Actually, I might leave the case in-tact for performance considerations. Yes,
> it's true that utf-8, UTF-8, uTf-8, UTf-8, UtF-8, etc. would all be the same, I
> suspect that only "utf-8" and "UTF-8" will be used in the wild with any
> reasonable frequency. Normalizing case for every lookup is probably a waste of
> time, unless there are significant concerns of DOS using long, non-normalized
> permutations of valid encodings (longest is x-MacCentralEurope with 17
> characters to play with). 17 characters is a lot of permutations (~2MiB),
> though.
Well, on my Windows machine the longest alias (not canonical name) of a charset
is "Extended_UNIX_Code_Packed_Format_for_Japanese" which consists of 39 muatble
characters. The current (trunk) implementation in
o.a.tomcat.util.buf.B2CConverter.getCharset() does not normalize the name, so a
Client could send requests with 2^39 permutations in a Content-Type header
(which would make 49 TiB of Charset strings) ;-)

Configure bugmail:
------- You are receiving this mail because: -------
You are the assignee for the bug.
To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message