lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: LowerCaseFilter fails one letter (I) of Turkish alphabet
Date Mon, 30 Nov 2009 21:14:42 GMT
On Mon, Nov 30, 2009 at 4:07 PM, Shai Erera <serera@gmail.com> wrote:

> Thanks again, I'll use this table as well.


you should only use it if you are normalizing to NFKC or NFKD afterwards...


> What I do is read those tables
> and store in a char[], for fast lookups of folding chars. I noticed your
> comments in the code about not doing so because then the tables would need
> to be updated once in a while, and I agree. But ICU's lack of char[] API
> drove me away from it. I've had bad experience, performance-wise, with it
> in
> the past.
>

the case folding here is "mostly" char[]. It will use StringBuffer in some
rarer cases (where a single codepoint folds to multiple codepoints, such as
sharp s -> ss), but in most cases does not make use of it.

I agree with your statement its difficult to use char[] with ICU, (which is
why I am using low-level UCaseProps)

on the other hand, the JDK is worse, the normalization filter here also uses
char[], and implements quick check, both of which you do not have in the
java 6 api (the normalizer only supports String, and there is no quick
check, only "possibly slow check" (in the case of MAYBE you essentially have
to normalize to figure out if its not, so why not do it once)


> I even compared Java's Collator to ICU's, and Java's seemed to perform
> faster to me, although that wasn't a real performance test. But ICU seems
> to
> be more accurate than Java's (which is annoying). I figured that I can
> apply
> some rules on my own, but the more I read about contrib/analyzers,
> contrib/collation, LUCENE-1488 and this thread, I think I'm beginning to
> understand that "on my own" means staying alert to a lot of stuff I'm not
> today :).
>
> Two comments about the patch in LUCENE-1488. In some places you use
> StringBuffer and others StringBuilder. Is that intentional? If not, I think
> you should move to StringBuilder.


its a stringbuffer because thats how its specified in icu's UCaseProps. but
see above, its not used except for the single -> multi codepoint foldings
(in which case the int return value does not return the result but instead
tells you 'go look in the stringbuffer')


> Also, in ICUCaseFoldingFilter, I believe
> termAtt can be declared final?
>

yeah, probably some other things too, thanks :)

>
> Thanks,
> Shai
>
> On Mon, Nov 30, 2009 at 10:46 PM, Robert Muir <rcmuir@gmail.com> wrote:
>
> > Shai, no, behind the scenes I am using just that table, via ICU library.
> >
> > The only reason the CaseFoldingFilter in my patch is more complex, is
> > because I also apply FC_NFKC_Closure mappings.
> > You can apply these tables in your impl too if you are also using
> > normalization, they are here:
> > http://unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
> >
> > The reasoning for this, is that if you are also normalizing to form NFKC
> or
> > NFKD, you would have to do NFKC(Fold(NFKC(Fold(x)))) or
> > NFKD(Fold(NFKD(Fold(x)))).
> >
> > with the mappings instead you can just do NFKC(Fold_w_closure(x)) and
> > NFKD(Fold_w_closure(x)), and avoid double normalization and folding for
> > better performance.
> >
> > On Mon, Nov 30, 2009 at 3:41 PM, Shai Erera <serera@gmail.com> wrote:
> >
> > > Thanks Robert. In my Analyzer I do case folding according to Unicode
> > > tables.
> > > So ß is converted to "SS". I do the same for diacritic removal and
> > > Hiragana/Katakan folding. I then apply a LowerCaseFilter, which gets
> the
> > > "SS" to "ss".
> > >
> > > I checked the filter's output on "AĞACIN" and it's "AGACIN". If I
> > > toLowerCase(new Locale("tr")), it's lowered to "agacın", which is
> > correct.
> > > Of course, LowerCaseFilter does not do that, I used String's.
> > >
> > > I just realized I've included lots of folding tables, except for
> > > http://unicode.org/Public/UNIDATA/CaseFolding.txt. I guess I counted
> on
> > > LowerCaseFilter too much. Is that the table you're working w/ in
> > > LUCENE-1488? I assume you use more of course :)
> > >
> > > Shai
> > >
> > > On Mon, Nov 30, 2009 at 10:00 PM, Robert Muir <rcmuir@gmail.com>
> wrote:
> > >
> > > > Shai, again the problem is not really performance (I am ignoring that
> > for
> > > > now), but the fact that lowercasing and case folding are different.
> > > >
> > > > An easy example, the lowercase of ß is ß itself, it is already
> > lowercase.
> > > > it will not match with 'SS' if you use lowercase filter.
> > > >
> > > > if you use case folding, these two will match.
> > > >
> > > > On Mon, Nov 30, 2009 at 2:53 PM, Shai Erera <serera@gmail.com>
> wrote:
> > > >
> > > > > Robert, what if I need to do additional filtering after
> > > > CollationKeyFilter,
> > > > > like stopwords removal, abbreviations handling, stemming etc? Will
> > that
> > > > be
> > > > > possible if I use CollationKeyFilter?
> > > > >
> > > > > I also noticed CKF creates a String out of the char[]. If the code
> > > > already
> > > > > does that, why not use String.toLowerCase(Locale)?
> > > > >
> > > > > Shai
> > > > >
> > > > > On Mon, Nov 30, 2009 at 9:46 PM, Simon Willnauer <
> > > > > simon.willnauer@googlemail.com> wrote:
> > > > >
> > > > > > On Mon, Nov 30, 2009 at 8:08 PM, Robert Muir <rcmuir@gmail.com>
> > > wrote:
> > > > > > >> I am not sure if it is worth to add a new TokenFilter
for
> > Turkish
> > > > > > language.
> > > > > > >> I see there exist GreekLowerCaseFilter and
> > RussianLowerCaseFilter.
> > > > It
> > > > > > would
> > > > > > >> be nice to see TurkishLowerCaseFilter in Lucene.
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > > just to clarify, GreekLowerCaseFilter really shouldn't
exist
> > > either.
> > > > > The
> > > > > > > final sigma problem it has (where there are two lowercase
forms
> > > > > depending
> > > > > > > upon position in word), this is also solved with unicode
case
> > > folding
> > > > > or
> > > > > > > collation. This is a perfect example of how lowercase is
the
> > wrong
> > > > > > operation
> > > > > > > for search.
> > > > > > >
> > > > > > > and RussianLowerCaseFilter is deprecated now, it does the
exact
> > > same
> > > > > > thing
> > > > > > > as LowerCaseFilter.
> > > > > > btw. we should fix supplementary chars in there too even if
it is
> > > > > > deprecated.
> > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Robert Muir
> > > > > > > rcmuir@gmail.com
> > > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcmuir@gmail.com
> > > >
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>



-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message