lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: LowerCaseFilter fails one letter (I) of Turkish alphabet
Date Mon, 30 Nov 2009 21:07:09 GMT
Thanks again, I'll use this table as well. What I do is read those tables
and store in a char[], for fast lookups of folding chars. I noticed your
comments in the code about not doing so because then the tables would need
to be updated once in a while, and I agree. But ICU's lack of char[] API
drove me away from it. I've had bad experience, performance-wise, with it in
the past.

I even compared Java's Collator to ICU's, and Java's seemed to perform
faster to me, although that wasn't a real performance test. But ICU seems to
be more accurate than Java's (which is annoying). I figured that I can apply
some rules on my own, but the more I read about contrib/analyzers,
contrib/collation, LUCENE-1488 and this thread, I think I'm beginning to
understand that "on my own" means staying alert to a lot of stuff I'm not
today :).

Two comments about the patch in LUCENE-1488. In some places you use
StringBuffer and others StringBuilder. Is that intentional? If not, I think
you should move to StringBuilder. Also, in ICUCaseFoldingFilter, I believe
termAtt can be declared final?

Thanks,
Shai

On Mon, Nov 30, 2009 at 10:46 PM, Robert Muir <rcmuir@gmail.com> wrote:

> Shai, no, behind the scenes I am using just that table, via ICU library.
>
> The only reason the CaseFoldingFilter in my patch is more complex, is
> because I also apply FC_NFKC_Closure mappings.
> You can apply these tables in your impl too if you are also using
> normalization, they are here:
> http://unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
>
> The reasoning for this, is that if you are also normalizing to form NFKC or
> NFKD, you would have to do NFKC(Fold(NFKC(Fold(x)))) or
> NFKD(Fold(NFKD(Fold(x)))).
>
> with the mappings instead you can just do NFKC(Fold_w_closure(x)) and
> NFKD(Fold_w_closure(x)), and avoid double normalization and folding for
> better performance.
>
> On Mon, Nov 30, 2009 at 3:41 PM, Shai Erera <serera@gmail.com> wrote:
>
> > Thanks Robert. In my Analyzer I do case folding according to Unicode
> > tables.
> > So ß is converted to "SS". I do the same for diacritic removal and
> > Hiragana/Katakan folding. I then apply a LowerCaseFilter, which gets the
> > "SS" to "ss".
> >
> > I checked the filter's output on "AĞACIN" and it's "AGACIN". If I
> > toLowerCase(new Locale("tr")), it's lowered to "agacın", which is
> correct.
> > Of course, LowerCaseFilter does not do that, I used String's.
> >
> > I just realized I've included lots of folding tables, except for
> > http://unicode.org/Public/UNIDATA/CaseFolding.txt. I guess I counted on
> > LowerCaseFilter too much. Is that the table you're working w/ in
> > LUCENE-1488? I assume you use more of course :)
> >
> > Shai
> >
> > On Mon, Nov 30, 2009 at 10:00 PM, Robert Muir <rcmuir@gmail.com> wrote:
> >
> > > Shai, again the problem is not really performance (I am ignoring that
> for
> > > now), but the fact that lowercasing and case folding are different.
> > >
> > > An easy example, the lowercase of ß is ß itself, it is already
> lowercase.
> > > it will not match with 'SS' if you use lowercase filter.
> > >
> > > if you use case folding, these two will match.
> > >
> > > On Mon, Nov 30, 2009 at 2:53 PM, Shai Erera <serera@gmail.com> wrote:
> > >
> > > > Robert, what if I need to do additional filtering after
> > > CollationKeyFilter,
> > > > like stopwords removal, abbreviations handling, stemming etc? Will
> that
> > > be
> > > > possible if I use CollationKeyFilter?
> > > >
> > > > I also noticed CKF creates a String out of the char[]. If the code
> > > already
> > > > does that, why not use String.toLowerCase(Locale)?
> > > >
> > > > Shai
> > > >
> > > > On Mon, Nov 30, 2009 at 9:46 PM, Simon Willnauer <
> > > > simon.willnauer@googlemail.com> wrote:
> > > >
> > > > > On Mon, Nov 30, 2009 at 8:08 PM, Robert Muir <rcmuir@gmail.com>
> > wrote:
> > > > > >> I am not sure if it is worth to add a new TokenFilter for
> Turkish
> > > > > language.
> > > > > >> I see there exist GreekLowerCaseFilter and
> RussianLowerCaseFilter.
> > > It
> > > > > would
> > > > > >> be nice to see TurkishLowerCaseFilter in Lucene.
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > > just to clarify, GreekLowerCaseFilter really shouldn't exist
> > either.
> > > > The
> > > > > > final sigma problem it has (where there are two lowercase forms
> > > > depending
> > > > > > upon position in word), this is also solved with unicode case
> > folding
> > > > or
> > > > > > collation. This is a perfect example of how lowercase is the
> wrong
> > > > > operation
> > > > > > for search.
> > > > > >
> > > > > > and RussianLowerCaseFilter is deprecated now, it does the exact
> > same
> > > > > thing
> > > > > > as LowerCaseFilter.
> > > > > btw. we should fix supplementary chars in there too even if it is
> > > > > deprecated.
> > > > >
> > > > > >
> > > > > > --
> > > > > > Robert Muir
> > > > > > rcmuir@gmail.com
> > > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message