lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <>
Subject Re: Dubious stuff spotted in LowerCaseFilter
Date Thu, 22 Oct 2015 12:27:08 GMT
Hi Uwe,

What is the meaning of "the Unicode Policeman" ?


On Thursday, October 22, 2015 2:59 PM, Uwe Schindler <> wrote:


> >> Setting aside the fact that Character.toLowerCase is already dubious
> >> in some locales (e.g. Turkish),
> >
> > This is not true. Character.toLowerCase() works locale-independent.
> > It is only String.toLowerCase that works using default locale.

So you mean the opposite. You wanted to have it locale-dependent. That’s already possible:
LowercaseFilter is documented to only use default unicode folding, no locale specific stuff.
If you have a turkish lucene field, you need to do locale-specific analysis anyways (e.g.
use TukishAnalyzer). This one uses TurkishLowercaseFilter. Having both variant as synonyms
needs more work, but out of the scope of this mail thread.

> Yet if you have a field like "title" and the user and system are Turkish, the
> user would expect their locale to apply, yet LowerCaseFilter will not handle
> that. So whereas it is "safe" for English hard-coded strings, it isn't safe for all
> fields you might index in general.

That's documented like that!

> Dawid's response shows, though, that at least for the time being, there is
> nothing to worry about. Hopefully Unicode will never add a code point which
> lowercases to one with less code units (or I guess changes one of the lower
> ones to lowercase to more than one...)

There was a discussion about that in JIRA already at the time of rewriting LowercaseFilter
to allow suppl characters outside BMP. I have to lookup the issue, but I am quite sure that
the Unicode Policeman did a lot of recherche and found some statement in Unicode spec that
the upper and lowercase letters are always in same block. I will try to look this up.


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message