lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: LowerCaseFilter fails one letter (I) of Turkish alphabet
Date Mon, 30 Nov 2009 20:46:30 GMT
Shai, no, behind the scenes I am using just that table, via ICU library.

The only reason the CaseFoldingFilter in my patch is more complex, is
because I also apply FC_NFKC_Closure mappings.
You can apply these tables in your impl too if you are also using
normalization, they are here:
http://unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt

The reasoning for this, is that if you are also normalizing to form NFKC or
NFKD, you would have to do NFKC(Fold(NFKC(Fold(x)))) or
NFKD(Fold(NFKD(Fold(x)))).

with the mappings instead you can just do NFKC(Fold_w_closure(x)) and
NFKD(Fold_w_closure(x)), and avoid double normalization and folding for
better performance.

On Mon, Nov 30, 2009 at 3:41 PM, Shai Erera <serera@gmail.com> wrote:

> Thanks Robert. In my Analyzer I do case folding according to Unicode
> tables.
> So ß is converted to "SS". I do the same for diacritic removal and
> Hiragana/Katakan folding. I then apply a LowerCaseFilter, which gets the
> "SS" to "ss".
>
> I checked the filter's output on "AĞACIN" and it's "AGACIN". If I
> toLowerCase(new Locale("tr")), it's lowered to "agacın", which is correct.
> Of course, LowerCaseFilter does not do that, I used String's.
>
> I just realized I've included lots of folding tables, except for
> http://unicode.org/Public/UNIDATA/CaseFolding.txt. I guess I counted on
> LowerCaseFilter too much. Is that the table you're working w/ in
> LUCENE-1488? I assume you use more of course :)
>
> Shai
>
> On Mon, Nov 30, 2009 at 10:00 PM, Robert Muir <rcmuir@gmail.com> wrote:
>
> > Shai, again the problem is not really performance (I am ignoring that for
> > now), but the fact that lowercasing and case folding are different.
> >
> > An easy example, the lowercase of ß is ß itself, it is already lowercase.
> > it will not match with 'SS' if you use lowercase filter.
> >
> > if you use case folding, these two will match.
> >
> > On Mon, Nov 30, 2009 at 2:53 PM, Shai Erera <serera@gmail.com> wrote:
> >
> > > Robert, what if I need to do additional filtering after
> > CollationKeyFilter,
> > > like stopwords removal, abbreviations handling, stemming etc? Will that
> > be
> > > possible if I use CollationKeyFilter?
> > >
> > > I also noticed CKF creates a String out of the char[]. If the code
> > already
> > > does that, why not use String.toLowerCase(Locale)?
> > >
> > > Shai
> > >
> > > On Mon, Nov 30, 2009 at 9:46 PM, Simon Willnauer <
> > > simon.willnauer@googlemail.com> wrote:
> > >
> > > > On Mon, Nov 30, 2009 at 8:08 PM, Robert Muir <rcmuir@gmail.com>
> wrote:
> > > > >> I am not sure if it is worth to add a new TokenFilter for Turkish
> > > > language.
> > > > >> I see there exist GreekLowerCaseFilter and RussianLowerCaseFilter.
> > It
> > > > would
> > > > >> be nice to see TurkishLowerCaseFilter in Lucene.
> > > > >>
> > > > >>
> > > > >>
> > > > > just to clarify, GreekLowerCaseFilter really shouldn't exist
> either.
> > > The
> > > > > final sigma problem it has (where there are two lowercase forms
> > > depending
> > > > > upon position in word), this is also solved with unicode case
> folding
> > > or
> > > > > collation. This is a perfect example of how lowercase is the wrong
> > > > operation
> > > > > for search.
> > > > >
> > > > > and RussianLowerCaseFilter is deprecated now, it does the exact
> same
> > > > thing
> > > > > as LowerCaseFilter.
> > > > btw. we should fix supplementary chars in there too even if it is
> > > > deprecated.
> > > >
> > > > >
> > > > > --
> > > > > Robert Muir
> > > > > rcmuir@gmail.com
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>



-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message