harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Ellison <t.p.elli...@gmail.com>
Subject [classlib] String.toLowerCase/toUpperCase incorrect for supplementary characters (HARMONY-6649)
Date Thu, 16 Sep 2010 13:05:20 GMT
On 16/Sep/2010 12:32, Robert Muir (JIRA) wrote:
> Does this mean harmony might need these methods for its own internal
> use before ICU is available?

Yes, String is used early in the bootstrapping, and having dependencies
on ICU functionality leads to an initialization circularity.

i.e. if I simply implement String#toUpperCase(Locale) as
"return UCharacter.toUpperCase(locale, this)"
then we fail to boot with

java.nio.charset.Charset (initialization failure)
     at java/lang/String.defaultCharset (String.java:736)
     at java/lang/String.<init> (String.java:232)
     at org/apache/harmony/luni/util/Util.toString (Util.java:102)
     at java/lang/System.getPropertyList (Native Method)
     at java/lang/System.ensureProperties (System.java:546)
     at java/lang/System.<clinit> (System.java:102)

Likely because we use nio Charset in the String implementation, and that
 in turn eventually calls String.toUppercase(), in CharsetProviderImpl
lines 113 and 145.

> When doing the toLowerCase(Locale)/toUpperCase(Locale), perhaps
> String.java could do:
> if (locale is not Turkish or Azeri or Lithuanian) while (ch < 0x7f) (
> just do optimized fast subtraction/addition ) ... // bail out
> completely and invoke 'UCharacter.xxx'

So I wrote a prototype version of this, that does the toUpperCase for
ascii (ch<128) in-lined (i.e. using simple math), then bailing out to
ICU for chars outside that range.

It seems to boot ok with that scheme, however, I'm running on a machine
with UK locale and running our code written in ascii text.  I'll need to
think about whether the boot sequence could ever get to the point where
it needs to uppercase non-ascii.

It may be better to take out the calls of toUpperCase from nio Charset,
and have a local method to deal with those, since there are restrictions
on the strings that can be used to name charsets.

> this might be good for performance reasons?

Yes, though hopefully ICU also optimize the simple cases too.

> And harmony itself, if it uses this method at this point, is likely
> using Locale.ENGLISH or similar for consistent behavior (filenames,
> etc) ?

The locale is set up early, but ascii is ascii no matter the locale, so
uppercasing that simpler set should be fine.

> Sorry I'm not too familiar with the codebase so I'm not sure
> if it would work. But it might speed up 'typical' lowercasing in any
> case, and as far as worst-case 2x for the "special casing": i find
> the "special" casing is going to be slow anyway: e.g. the Greek sigma
> example needs to calculate word boundaries!
> the Turkish/Azeri case is trickier than the existing code, I think it
> should use UCharacter.XXX too. The reason is it has to be able to
> handle the case from SpecialCasing where the 'dotted I' is written in
> decomposed form (e.g. I + COMBINING DOT ABOVE)

Yep, I'm happy to let ICU handle these interesting cases.


View raw message