lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject [lucy-dev] Re: [lucy-commits] svn commit: r1210630 - in /incubator/lucy/branches/LUCY-196-uax-tokenizer: core/Lucy/Analysis/PolyAnalyzer.c core/Lucy/Analysis/PolyAnalyzer.cfh perl/lib/Lucy/Analysis/
Date Mon, 05 Dec 2011 21:38:14 GMT
Hi, Nick,

Awesome stuff coming through on the new Lucy::Analysis::StandardTokenizer!

On Mon, Dec 05, 2011 at 09:02:42PM -0000, wrote:
>  PolyAnalyzer*
>  PolyAnalyzer_new(const CharBuf *language, VArray *analyzers) {
> @@ -43,7 +43,7 @@ PolyAnalyzer_init(PolyAnalyzer *self, co
>      else if (language) {
>          self->analyzers = VA_new(3);
>          VA_Push(self->analyzers, (Obj*)CaseFolder_new());
> -        VA_Push(self->analyzers, (Obj*)RegexTokenizer_new(NULL));
> +        VA_Push(self->analyzers, (Obj*)StandardTokenizer_new());
>          VA_Push(self->analyzers, (Obj*)SnowStemmer_new(language));
>      }

This will cause a backwards compatibility break.  I really want to make your
StandardTokenizer the default, but I think we might want to go about it

How about we leave PolyAnalyzer alone, but add a new class called
"EasyAnalyzer", with the following default stack:

    1. StandardTokenizer
    2. Normalizer
    3. SnowballStemmer

This integrates both your recent contributions, plus changes the order to be
avoid the Highlighter problems you identified and be more in line with the
potential refactoring you talked about.

It would be nice to benchmark this just to see what sort of performance impact
changing the order has before we finalize it.

If this works out, we can then swap out PolyAnalyzer for EasyAnalyzer
throughout the tutorial and other high-level documentation.

Marvin Humphrey

View raw message