lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trejkaz <trej...@trypticon.org>
Subject Dubious stuff spotted in LowerCaseFilter
Date Thu, 22 Oct 2015 05:14:59 GMT
Hi all.

LowerCaseFilter uses CharacterUtils.toLowerCase to perform its work.
The latter method looks like this:

public final void toLowerCase(final char[] buffer, final int offset,
final int limit) {
  assert buffer.length >= limit;
  assert offset <=0 && offset <= buffer.length;
  for (int i = offset; i < limit;) {
    i += Character.toChars(
            Character.toLowerCase(
                codePointAt(buffer, i, limit)), buffer, i);
   }
}

Setting aside the fact that Character.toLowerCase is already dubious
in some locales (e.g. Turkish), I notice that this is using the same
"i" index counter to refer to both the source offset and the
destination offset. So basically, this code has an undocumented
assumption that Character.toLowerCase always returns a code point
which takes up the same number of characters as the original one.

Whereas I do suppose that this might be the case, did someone actually
verify it? Say, by iterating all code points or something? How
confident are we that this will continue to be the case forever? :)

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message