lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: LowerCaseFilter fails one letter (I) of Turkish alphabet
Date Mon, 30 Nov 2009 19:53:30 GMT
Robert, what if I need to do additional filtering after CollationKeyFilter,
like stopwords removal, abbreviations handling, stemming etc? Will that be
possible if I use CollationKeyFilter?

I also noticed CKF creates a String out of the char[]. If the code already
does that, why not use String.toLowerCase(Locale)?

Shai

On Mon, Nov 30, 2009 at 9:46 PM, Simon Willnauer <
simon.willnauer@googlemail.com> wrote:

> On Mon, Nov 30, 2009 at 8:08 PM, Robert Muir <rcmuir@gmail.com> wrote:
> >> I am not sure if it is worth to add a new TokenFilter for Turkish
> language.
> >> I see there exist GreekLowerCaseFilter and RussianLowerCaseFilter. It
> would
> >> be nice to see TurkishLowerCaseFilter in Lucene.
> >>
> >>
> >>
> > just to clarify, GreekLowerCaseFilter really shouldn't exist either. The
> > final sigma problem it has (where there are two lowercase forms depending
> > upon position in word), this is also solved with unicode case folding or
> > collation. This is a perfect example of how lowercase is the wrong
> operation
> > for search.
> >
> > and RussianLowerCaseFilter is deprecated now, it does the exact same
> thing
> > as LowerCaseFilter.
> btw. we should fix supplementary chars in there too even if it is
> deprecated.
>
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message