incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Implementing a tokenizer in core
Date Wed, 23 Nov 2011 02:50:26 GMT
On Tue, Nov 22, 2011 at 11:10:59PM +0100, Nick Wellnhofer wrote:
> With the help of utf8proc we could implement a simple but more efficient
> tokenizer without external dependencies in core.

I like the idea.  It would be less flexible, but that's not a problem if we
continue to offer RegexTokenizer in addition to this one.

> Most important, we'd have to implement something similar to the \w regex
> character class.

Just a passing thought: I wonder if we could abuse the Lemon parser generator
for this.  It's a parser, not a lexer, but...

It would potentially be easier for sophisticated users to hack a grammar file
than a hand-coded lexer.

It would also be nice to use Lemon as much as we can so that more people get
familiar with it, and thus able to maintain all parts of Lucy that use it.

> The default regex in RegexTokenizer also handles apostrophes which I  
> don't find very useful personally. But this could also be implemented in  
> the core tokenizer.

How about making this tokenizer implement the word break rules described in
the Unicode standard annex on Text Segmentation?  That's what the Lucene
StandardTokenizer does (as of 3.1).

    http://unicode.org/reports/tr29/

I don't think we need to worry much about making this tokenizer flexible.  We
already offer a certain amount of flexibility via RegexTokenizer.

Marvin Humphrey


Mime
View raw message