lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@gmail.com>
Subject Re: Dubious stuff spotted in LowerCaseFilter
Date Thu, 22 Oct 2015 08:15:41 GMT
I think the issue here is what happens if an "uppercase" codepoint requires
a surrogate pair and the lowercase counterpart does not -- then the index
variable would indeed be screwed.

Dawid

On Thu, Oct 22, 2015 at 10:05 AM, Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi,
>
> > Setting aside the fact that Character.toLowerCase is already dubious in
> some locales (e.g. Turkish),
>
> This is not true. Character.toLowerCase() works locale-independent. It is
> only String.toLowerCase that works using default locale.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: Trejkaz [mailto:trejkaz@trypticon.org]
> > Sent: Thursday, October 22, 2015 7:15 AM
> > To: Lucene Users Mailing List
> > Subject: Dubious stuff spotted in LowerCaseFilter
> >
> > Hi all.
> >
> > LowerCaseFilter uses CharacterUtils.toLowerCase to perform its work.
> > The latter method looks like this:
> >
> > public final void toLowerCase(final char[] buffer, final int offset,
> final int limit)
> > {
> >   assert buffer.length >= limit;
> >   assert offset <=0 && offset <= buffer.length;
> >   for (int i = offset; i < limit;) {
> >     i += Character.toChars(
> >             Character.toLowerCase(
> >                 codePointAt(buffer, i, limit)), buffer, i);
> >    }
> > }
> >
> > Setting aside the fact that Character.toLowerCase is already dubious in
> some
> > locales (e.g. Turkish), I notice that this is using the same "i" index
> counter to
> > refer to both the source offset and the destination offset. So
> basically, this
> > code has an undocumented assumption that Character.toLowerCase always
> > returns a code point which takes up the same number of characters as the
> > original one.
> >
> > Whereas I do suppose that this might be the case, did someone actually
> > verify it? Say, by iterating all code points or something? How confident
> are
> > we that this will continue to be the case forever? :)
> >
> > TX
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message