lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Dubious stuff spotted in LowerCaseFilter
Date Thu, 22 Oct 2015 12:57:35 GMT
> What is the meaning of "the Unicode Policeman" ?

Robert Muir :-)

Uwe

> Thanks,
> Ahmet
> 
> On Thursday, October 22, 2015 2:59 PM, Uwe Schindler <uwe@thetaphi.de>
> wrote:
> 
> 
> 
> Hi,
> 
> 
> > >> Setting aside the fact that Character.toLowerCase is already
> > >> dubious in some locales (e.g. Turkish),
> > >
> > > This is not true. Character.toLowerCase() works locale-independent.
> > > It is only String.toLowerCase that works using default locale.
> 
> So you mean the opposite. You wanted to have it locale-dependent. That’s
> already possible: LowercaseFilter is documented to only use default unicode
> folding, no locale specific stuff. If you have a turkish lucene field, you need to
> do locale-specific analysis anyways (e.g. use TukishAnalyzer). This one uses
> TurkishLowercaseFilter. Having both variant as synonyms needs more work,
> but out of the scope of this mail thread.
> 
> > Yet if you have a field like "title" and the user and system are
> > Turkish, the user would expect their locale to apply, yet
> > LowerCaseFilter will not handle that. So whereas it is "safe" for
> > English hard-coded strings, it isn't safe for all fields you might index in
> general.
> 
> That's documented like that!
> 
> > Dawid's response shows, though, that at least for the time being,
> > there is nothing to worry about. Hopefully Unicode will never add a
> > code point which lowercases to one with less code units (or I guess
> > changes one of the lower ones to lowercase to more than one...)
> 
> There was a discussion about that in JIRA already at the time of rewriting
> LowercaseFilter to allow suppl characters outside BMP. I have to lookup the
> issue, but I am quite sure that the Unicode Policeman did a lot of recherche
> and found some statement in Unicode spec that the upper and lowercase
> letters are always in same block. I will try to look this up.
> 
> 
> Uwe
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message