lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <>
Subject RE: Dubious stuff spotted in LowerCaseFilter
Date Thu, 22 Oct 2015 11:58:48 GMT

> >> Setting aside the fact that Character.toLowerCase is already dubious
> >> in some locales (e.g. Turkish),
> >
> > This is not true. Character.toLowerCase() works locale-independent.
> > It is only String.toLowerCase that works using default locale.

So you mean the opposite. You wanted to have it locale-dependent. That’s already possible:
LowercaseFilter is documented to only use default unicode folding, no locale specific stuff.
If you have a turkish lucene field, you need to do locale-specific analysis anyways (e.g.
use TukishAnalyzer). This one uses TurkishLowercaseFilter. Having both variant as synonyms
needs more work, but out of the scope of this mail thread.
> Yet if you have a field like "title" and the user and system are Turkish, the
> user would expect their locale to apply, yet LowerCaseFilter will not handle
> that. So whereas it is "safe" for English hard-coded strings, it isn't safe for all
> fields you might index in general.

That's documented like that!

> Dawid's response shows, though, that at least for the time being, there is
> nothing to worry about. Hopefully Unicode will never add a code point which
> lowercases to one with less code units (or I guess changes one of the lower
> ones to lowercase to more than one...)

There was a discussion about that in JIRA already at the time of rewriting LowercaseFilter
to allow suppl characters outside BMP. I have to lookup the issue, but I am quite sure that
the Unicode Policeman did a lot of recherche and found some statement in Unicode spec that
the upper and lowercase letters are always in same block. I will try to look this up.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message