incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <wellnho...@aevum.de>
Subject Re: [lucy-dev] Extending the StandardTokenizer
Date Fri, 24 Feb 2012 13:28:49 GMT
On 24/02/2012 00:27, Marvin Humphrey wrote:
> On Thu, Feb 23, 2012 at 01:56:33PM +0100, Nick Wellnhofer wrote:
> I'm -0 on adding a new non-extendable "WordTokenizer" class, though.
> WordTokenizer expands Lucy's public API for no other reason than performance
> in a not-very-common use case; that's not a good rationale for taking on the
> maintenance burden of a new public class, and it sets a bad precedent.  Next
> up will be WhiteSpaceTokenizer, and down the road we go... It will never end,
> because users have so many different tokenization requirements.

A second benefit over RegexTokenizer is better Unicode support, although 
I'm not really interested in that personally. I also don't think 
maintaining another tokenizer class would be much of a problem. I'd 
rather measure the burden of maintainability in lines of code than 
number of public classes. Consequently, I'd be OK with a 
WhiteSpaceTokenizer and a couple of other tokenizers if their 
implementation is as trivial as the WordTokenizer I proposed.

OTOH, I don't want to force a specialized tokenizer into the code base 
that no one besides me deems useful. I'd prefer to work on support for 
compiled extensions, so everyone can make small extensions to Lucy in C 
without seeking public consensus or maintaining their own patchsets.

> If you don't have time to work on an extension mechanism for
> StandardTokenizer, maybe I can help out.  I'm going to go study up on UAX #29
> and how you implemented StandardTokenizer and see if I can come up with
> any ideas.
>
> In the meantime, if you want to commit WordTokenizer, I won't object.  FWIW, I
> believe that the analogous Lucene class is called "LetterTokenizer", so you
> might consider renaming it.
>
>    http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/analysis/LetterTokenizer.html

I had a look at Lucene's tokenizers. LetterTokenizer seems to work with 
letters only, whereas my proposed WordTokenizer also works with numbers. 
I know that the name is maybe a bit too general, but I couldn't come up 
with something better.

> However, I hope we can manage to create an extension mechanism before the
> release of Lucy 0.4.0 which allows the user to code up the equivalent of
> WordTokenizer as a user-space subclass, and that you won't object to the
> removing WordTokenizer before it escapes into the wild in that case.

Nevermind. I decided against committing my patch in its current form.

Nick

Mime
View raw message