incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Extending the StandardTokenizer
Date Mon, 20 Feb 2012 17:27:37 GMT
On Mon, Feb 20, 2012 at 01:52:26PM +0100, Nick Wellnhofer wrote:
> Currently, the new StandardTokenizer implements the word break algorithm  
> as defined in Unicode Annex #29. One detail of this algorithm is that it  
> defines a set of "MidLetter" and "MidNum" characters which don't break a  
> sequence of letters or numbers. It seems the main reason is to not break  
> around characters like apostrophes or number separators.
>
> While some people might prefer this behavior, I'd like to add second  
> mode of operation that does split on all characters that are not  
> alphanumeric with the exception of underscores. This would very much  
> resemble a RegexTokenizer with a \w+ pattern.

The documentation for the Lucene StandardTokenizer contains this paragraph:

    Many applications have specific tokenizer needs. If this tokenizer does
    not suit your application, please consider copying this source code
    directory to your project and maintaining your own grammar-based
    tokenizer. 

There is a lot of accumulated wisdom in that passage that I think we ought to
consider.

> The whole thing could be implemented by simply adding an option to
> StandardTokenizer so that "MidLetter" and "MidNum" characters are ignored.

I'm concerned that this is may be the first feature request of many to come
for StandardTokenizer, and that attempting to support all such requests within
core is not sustainable.

To address the immediate concern, is it an option to just use RegexTokenizer
with a \w+ pattern?  RegexTokenizer's primary utility is that it solves many,
many use cases while posing a minimal ongoing maintenance burden.

Thinking longer term, I believe the interests of all would be best served if
the progression went something like this:

   1. Start with StandardTokenizer as the default.
   2. Change to RegexTokenizer (or others) to address logical requirements
      that StandardTokenizer does not meet. 
   3. Compile your own Tokenizer when you need to max out performance.

In other words, our focus should be on making it possible to "Extend
StandardTokenizer" (and potentially other Analyzers) arbitrarily.

If that goal seems to far away, then my next suggestion would be to create a
LucyX class to house a StandardTokenizer embellished with arbitrary extensions
-- working name: LucyX::Analysis::NonStandardTokenizer.

Marvin Humphrey


Mime
View raw message