lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@gmail.com>
Subject Re: Dubious stuff spotted in LowerCaseFilter
Date Thu, 22 Oct 2015 08:24:34 GMT
Well, practice says there are no such cases...

        for (int cp = Character.MIN_CODE_POINT; cp <
Character.MAX_CODE_POINT; cp++) {
            int c1 = Character.charCount(cp);
            int c2 = Character.charCount(Character.toUpperCase(cp));
            int c3 = Character.charCount(Character.toLowerCase(cp));
            if (c1 != c2 ||
                c1 != c3) {
                System.out.println(String.format(Locale.ROOT,
                    "%d %d %d",
                    c1, c2, c3));
            }
        }

D.

On Thu, Oct 22, 2015 at 10:15 AM, Dawid Weiss <dawid.weiss@gmail.com> wrote:

>
> I think the issue here is what happens if an "uppercase" codepoint
> requires a surrogate pair and the lowercase counterpart does not -- then
> the index variable would indeed be screwed.
>
> Dawid
>
> On Thu, Oct 22, 2015 at 10:05 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
>
>> Hi,
>>
>> > Setting aside the fact that Character.toLowerCase is already dubious in
>> some locales (e.g. Turkish),
>>
>> This is not true. Character.toLowerCase() works locale-independent. It is
>> only String.toLowerCase that works using default locale.
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>>
>> > -----Original Message-----
>> > From: Trejkaz [mailto:trejkaz@trypticon.org]
>> > Sent: Thursday, October 22, 2015 7:15 AM
>> > To: Lucene Users Mailing List
>> > Subject: Dubious stuff spotted in LowerCaseFilter
>> >
>> > Hi all.
>> >
>> > LowerCaseFilter uses CharacterUtils.toLowerCase to perform its work.
>> > The latter method looks like this:
>> >
>> > public final void toLowerCase(final char[] buffer, final int offset,
>> final int limit)
>> > {
>> >   assert buffer.length >= limit;
>> >   assert offset <=0 && offset <= buffer.length;
>> >   for (int i = offset; i < limit;) {
>> >     i += Character.toChars(
>> >             Character.toLowerCase(
>> >                 codePointAt(buffer, i, limit)), buffer, i);
>> >    }
>> > }
>> >
>> > Setting aside the fact that Character.toLowerCase is already dubious in
>> some
>> > locales (e.g. Turkish), I notice that this is using the same "i" index
>> counter to
>> > refer to both the source offset and the destination offset. So
>> basically, this
>> > code has an undocumented assumption that Character.toLowerCase always
>> > returns a code point which takes up the same number of characters as the
>> > original one.
>> >
>> > Whereas I do suppose that this might be the case, did someone actually
>> > verify it? Say, by iterating all code points or something? How
>> confident are
>> > we that this will continue to be the case forever? :)
>> >
>> > TX
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message