lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Implementing a tokenizer in core
Date Thu, 24 Nov 2011 21:41:21 GMT
On Wed, Nov 23, 2011 at 10:53:54PM +0100, Nick Wellnhofer wrote:
> On 23/11/11 03:50, Marvin Humphrey wrote:
>> How about making this tokenizer implement the word break rules described in
>> the Unicode standard annex on Text Segmentation?  That's what the Lucene
>> StandardTokenizer does (as of 3.1).
>
> That would certainly be a nice choice for the default tokenizer. It  
> would be easy to implement with ICU but utf8proc doesn't buy us much 
> here.

Hmm, that's unfortunate.  I think this would be a very nice feature to offer.

>> I don't think we need to worry much about making this tokenizer flexible.  We
>> already offer a certain amount of flexibility via RegexTokenizer.
>
> Yes, making this tokenizer customizable probably isn't worth the effort.  
> I'd be happy with a simple tokenizer that extracts \w+ tokens. I can  
> offer to implement such a tokenizer if it's deemed useful.

A straight up \w+ tokenizer wouldn't be optimal for English, at least.  It
would break on apostrophes, resulting in a large number of solitary 's' tokens
thanks to possesives and contractions -- i.e. "maggie's farm" would tokenize
as ["maggie", "s", "farm"] instead of ["maggie's", "farm"].

Marvin Humphrey


Mime
View raw message