lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <wellnho...@aevum.de>
Subject Re: [lucy-dev] Re: [lucy-commits] svn commit: r1210630 - in /incubator/lucy/branches/LUCY-196-uax-tokenizer: core/Lucy/Analysis/PolyAnalyzer.c core/Lucy/Analysis/PolyAnalyzer.cfh perl/lib/Lucy/Analysis/PolyAnalyzer.pm
Date Tue, 06 Dec 2011 12:51:05 GMT
On 05/12/2011 22:38, Marvin Humphrey wrote:
> Hi, Nick,
>
> Awesome stuff coming through on the new Lucy::Analysis::StandardTokenizer!
>
> On Mon, Dec 05, 2011 at 09:02:42PM -0000, nwellnhof@apache.org wrote:
>>   PolyAnalyzer*
>>   PolyAnalyzer_new(const CharBuf *language, VArray *analyzers) {
>> @@ -43,7 +43,7 @@ PolyAnalyzer_init(PolyAnalyzer *self, co
>>       else if (language) {
>>           self->analyzers = VA_new(3);
>>           VA_Push(self->analyzers, (Obj*)CaseFolder_new());
>> -        VA_Push(self->analyzers, (Obj*)RegexTokenizer_new(NULL));
>> +        VA_Push(self->analyzers, (Obj*)StandardTokenizer_new());
>>           VA_Push(self->analyzers, (Obj*)SnowStemmer_new(language));
>>       }
>
> This will cause a backwards compatibility break.  I really want to make your
> StandardTokenizer the default, but I think we might want to go about it
> differently.

I made that change mainly to see if the test suite breaks (and it 
didn't). I plan to revert it before committing StandardTokenizer to trunk.

> How about we leave PolyAnalyzer alone, but add a new class called
> "EasyAnalyzer", with the following default stack:
>
>      1. StandardTokenizer
>      2. Normalizer
>      3. SnowballStemmer
>
> This integrates both your recent contributions, plus changes the order to be
> avoid the Highlighter problems you identified and be more in line with the
> potential refactoring you talked about.
>
> It would be nice to benchmark this just to see what sort of performance impact
> changing the order has before we finalize it.
>
> If this works out, we can then swap out PolyAnalyzer for EasyAnalyzer
> throughout the tutorial and other high-level documentation.

Sounds like a good idea.

Nick

Mime
View raw message