lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <>
Subject Re: [lucy-dev] Implementing a tokenizer in core
Date Wed, 23 Nov 2011 21:53:54 GMT
On 23/11/11 03:50, Marvin Humphrey wrote:
> How about making this tokenizer implement the word break rules described in
> the Unicode standard annex on Text Segmentation?  That's what the Lucene
> StandardTokenizer does (as of 3.1).

That would certainly be a nice choice for the default tokenizer. It 
would be easy to implement with ICU but utf8proc doesn't buy us much here.

> I don't think we need to worry much about making this tokenizer flexible.  We
> already offer a certain amount of flexibility via RegexTokenizer.

Yes, making this tokenizer customizable probably isn't worth the effort. 
I'd be happy with a simple tokenizer that extracts \w+ tokens. I can 
offer to implement such a tokenizer if it's deemed useful.


View raw message