tomcat-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 51400] Use of "new String(byte[] b, String enc)" hits Sun JVM bottleneck
Date Thu, 23 Jun 2011 20:02:11 GMT
https://issues.apache.org/bugzilla/show_bug.cgi?id=51400

--- Comment #12 from Christopher Schultz <chris@christopherschultz.net> 2011-06-23 20:02:11
UTC ---
> > I suppose it's a fairly small set of encodings, but with little benefit,
> > there's no reason IMO to pre-populate.
>
> You're right; however if I read the reports correctly, this is true if charsets
> with valid names only are used for the lookup. But everytime when there is a
> loopkup for a non-existing Charset, the JVM-synchronized Charset.lookup() is
> called. Probably to speed this up, Konstantin Kolinko suggested to cache
> charset missings.

Duh. I hadn't thought of spurious lookups causing their own synchronization
disasters.

Perhaps the invalid-charset cache could be limited in some way: MRU caches are
easy to build with the standard Java library.

> If a list with all avaliable charsets would be pre-populated, including their
> aliases, missing charsets could also be determined fast. 

True: if the encoding is not supported by the JVM, then it's invalid no matter
what. In that case, case normalization is probably a good thing to do: if it's
not in the case (after normalization), then it's not valid... no reason to ever
call Charset.lookup() after startup.

> Well, on my Windows machine the longest alias (not canonical name) of a charset
> is "Extended_UNIX_Code_Packed_Format_for_Japanese" which consists of 39 mutable
> characters.

Wow.

> The current (trunk) implementation in
> o.a.tomcat.util.buf.B2CConverter.getCharset() does not normalize the name, so a
> Client could send requests with 2^39 permutations in a Content-Type header
> (which would make 49 TiB of Charset strings) ;-)

My math might be wrong, too, but I believe that's only 512GiB if names are
1-byte-per-char, but I think Java does 2-bytes-per-char, so it's 1TiB.

You're right, though: that's pretty huge.

+1 to case normalization.
+1 to LUT pre-population.
-1 to LUT miss caching: it's totally unnecessary given the above.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


Mime
View raw message