incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <wellnho...@aevum.de>
Subject Re: [lucy-dev] Extending the StandardTokenizer
Date Mon, 20 Feb 2012 19:07:00 GMT
On 20/02/2012 18:27, Marvin Humphrey wrote:
> On Mon, Feb 20, 2012 at 01:52:26PM +0100, Nick Wellnhofer wrote:
>> Currently, the new StandardTokenizer implements the word break algorithm
>> as defined in Unicode Annex #29. One detail of this algorithm is that it
>> defines a set of "MidLetter" and "MidNum" characters which don't break a
>> sequence of letters or numbers. It seems the main reason is to not break
>> around characters like apostrophes or number separators.
>>
>> While some people might prefer this behavior, I'd like to add second
>> mode of operation that does split on all characters that are not
>> alphanumeric with the exception of underscores. This would very much
>> resemble a RegexTokenizer with a \w+ pattern.
>
> The documentation for the Lucene StandardTokenizer contains this paragraph:
>
>      Many applications have specific tokenizer needs. If this tokenizer does
>      not suit your application, please consider copying this source code
>      directory to your project and maintaining your own grammar-based
>      tokenizer.
>
> There is a lot of accumulated wisdom in that passage that I think we ought to
> consider.
>
>> The whole thing could be implemented by simply adding an option to
>> StandardTokenizer so that "MidLetter" and "MidNum" characters are ignored.
>
> I'm concerned that this is may be the first feature request of many to come
> for StandardTokenizer, and that attempting to support all such requests within
> core is not sustainable.

I understand your concern, but I think the extension I proposed is the 
most useful and obvious. I'm biased, of course, as other people will 
have different needs.

> To address the immediate concern, is it an option to just use RegexTokenizer
> with a \w+ pattern?  RegexTokenizer's primary utility is that it solves many,
> many use cases while posing a minimal ongoing maintenance burden.

A plain \w+ pattern would work for me. I'm mainly interested in the 
performance benefits of StandardTokenizer.

Actually, you can formulate the complete UAX#29 word breaking rules as a 
Perl regex which is even quite readable. But performance would probably 
suffer even more because you'd have to use Perl's \p{} construct to 
lookup word break properties.

> Thinking longer term, I believe the interests of all would be best served if
> the progression went something like this:
>
>     1. Start with StandardTokenizer as the default.
>     2. Change to RegexTokenizer (or others) to address logical requirements
>        that StandardTokenizer does not meet.
>     3. Compile your own Tokenizer when you need to max out performance.
>
> In other words, our focus should be on making it possible to "Extend
> StandardTokenizer" (and potentially other Analyzers) arbitrarily.

One solution I've been thinking about is to make StandardTokenizer work 
with arbitrary word break property tables. That is, use the rules 
described in UAX#29 but allow for customized mappings of the word break 
property which should cover many use cases. This would basically mean to 
port the code in devel/bin/UnicodeTable.pm to C and provide a nice 
public interface. It's certainly feasible but there are some challenges 
involved, serialization for example.

> If that goal seems to far away, then my next suggestion would be to create a
> LucyX class to house a StandardTokenizer embellished with arbitrary extensions
> -- working name: LucyX::Analysis::NonStandardTokenizer.

That would be OK with me. On another note, is it possible to package 
Lucy extensions that contain C code outside of the main source tree?

Nick

Mime
View raw message