lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <>
Subject Re: [lucy-dev] Implementing a tokenizer in core
Date Wed, 30 Nov 2011 15:40:00 GMT
On 24/11/2011 22:41, Marvin Humphrey wrote:
> On Wed, Nov 23, 2011 at 10:53:54PM +0100, Nick Wellnhofer wrote:
>> On 23/11/11 03:50, Marvin Humphrey wrote:
>>> How about making this tokenizer implement the word break rules described in
>>> the Unicode standard annex on Text Segmentation?  That's what the Lucene
>>> StandardTokenizer does (as of 3.1).
>> That would certainly be a nice choice for the default tokenizer. It
>> would be easy to implement with ICU but utf8proc doesn't buy us much
>> here.
> Hmm, that's unfortunate.  I think this would be a very nice feature to offer.

I had a closer look at the word boundary rules in UAX #29, and they 
shouldn't be too hard to implement without using an external library. I 
started with an initial prototype and it looks very promising.

In order to lookup the Word_Break property values, we have to precompute 
a few tables. I would write a Perl script for that. The tables can be 
generated once and shipped with the source code much like the tables for 
utf8proc. I'm not sure where to put that script and the generated 
tables, though.


View raw message