lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: LowerCaseFilter fails one letter (I) of Turkish alphabet
Date Tue, 01 Dec 2009 18:43:18 GMT
Hi Ahmet,

After thinking about what Shai brought up, I changed my mind and think it is
not good enough that we only have Collation as a way to solve this.
Because you might want turkish stemming too, and right now there is no way
for the included snowball turkish stemmer to work.
I really do not like this.

So as much as I want to reduce clutter and not have lots of filters that can
be solved in a general way with unicode, I think this is one case
where the best solution for now would be to have a turkish-specific
lowercasefilter...

I don't think we have to use String for this either, we can just apply rules
to the two uppercase I's, and lowercase everything else.

Will you open an issue?


On Mon, Nov 30, 2009 at 2:00 PM, AHMET ARSLAN <iorixxx@yahoo.com> wrote:

> In Turkish alphabet lowercase of I is not i. It is LATIN SMALL LETTER
> DOTLESS I. LowerCaseFilter which uses Character.toLowerCase() makes mistake
> just for that character.
>
> http://java.sun.com/javase/6/docs/api/java/lang/String.html#toLowerCase()<http://java.sun.com/javase/6/docs/api/java/lang/String.html#toLowerCase%28%29>
>
> I am not sure if it is worth to add a new TokenFilter for Turkish language.
> I see there exist GreekLowerCaseFilter and RussianLowerCaseFilter. It would
> be nice to see TurkishLowerCaseFilter in Lucene.
>
> Wiki recommends to ask permission from lucene committers before opening an
> issue. I can provide a patch (although it is just a one line change in
> original LowercaseFilter) for that if you want.
>
> Thank you for your consideration.
>
> Ahmet
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message