lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: LowerCaseFilter fails one letter (I) of Turkish alphabet
Date Mon, 30 Nov 2009 20:41:02 GMT
Thanks Robert. In my Analyzer I do case folding according to Unicode tables.
So ß is converted to "SS". I do the same for diacritic removal and
Hiragana/Katakan folding. I then apply a LowerCaseFilter, which gets the
"SS" to "ss".

I checked the filter's output on "AĞACIN" and it's "AGACIN". If I
toLowerCase(new Locale("tr")), it's lowered to "agacın", which is correct.
Of course, LowerCaseFilter does not do that, I used String's.

I just realized I've included lots of folding tables, except for
http://unicode.org/Public/UNIDATA/CaseFolding.txt. I guess I counted on
LowerCaseFilter too much. Is that the table you're working w/ in
LUCENE-1488? I assume you use more of course :)

Shai

On Mon, Nov 30, 2009 at 10:00 PM, Robert Muir <rcmuir@gmail.com> wrote:

> Shai, again the problem is not really performance (I am ignoring that for
> now), but the fact that lowercasing and case folding are different.
>
> An easy example, the lowercase of ß is ß itself, it is already lowercase.
> it will not match with 'SS' if you use lowercase filter.
>
> if you use case folding, these two will match.
>
> On Mon, Nov 30, 2009 at 2:53 PM, Shai Erera <serera@gmail.com> wrote:
>
> > Robert, what if I need to do additional filtering after
> CollationKeyFilter,
> > like stopwords removal, abbreviations handling, stemming etc? Will that
> be
> > possible if I use CollationKeyFilter?
> >
> > I also noticed CKF creates a String out of the char[]. If the code
> already
> > does that, why not use String.toLowerCase(Locale)?
> >
> > Shai
> >
> > On Mon, Nov 30, 2009 at 9:46 PM, Simon Willnauer <
> > simon.willnauer@googlemail.com> wrote:
> >
> > > On Mon, Nov 30, 2009 at 8:08 PM, Robert Muir <rcmuir@gmail.com> wrote:
> > > >> I am not sure if it is worth to add a new TokenFilter for Turkish
> > > language.
> > > >> I see there exist GreekLowerCaseFilter and RussianLowerCaseFilter.
> It
> > > would
> > > >> be nice to see TurkishLowerCaseFilter in Lucene.
> > > >>
> > > >>
> > > >>
> > > > just to clarify, GreekLowerCaseFilter really shouldn't exist either.
> > The
> > > > final sigma problem it has (where there are two lowercase forms
> > depending
> > > > upon position in word), this is also solved with unicode case folding
> > or
> > > > collation. This is a perfect example of how lowercase is the wrong
> > > operation
> > > > for search.
> > > >
> > > > and RussianLowerCaseFilter is deprecated now, it does the exact same
> > > thing
> > > > as LowerCaseFilter.
> > > btw. we should fix supplementary chars in there too even if it is
> > > deprecated.
> > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcmuir@gmail.com
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message