lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <wellnho...@aevum.de>
Subject [lucy-dev] Implementing a tokenizer in core
Date Tue, 22 Nov 2011 22:10:59 GMT
Currently, Lucy only provides the RegexTokenizer which is implemented on 
top of the perl regex engine. With the help of utf8proc we could 
implement a simple but more efficient tokenizer without external 
dependencies in core. Most important, we'd have to implement something 
similar to the \w regex character class. The Unicode standard [1,2] 
recommends that \w is equivalent to 
[\pL\pM\p{Nd}\p{Nl}\p{Pc}\x{24b6}-\x{24e9}], that is Unicode categories 
Letter, Mark, Decimal_Number, Letter_Number, and Connector_Punctuation 
plus circled letters. That's exactly how perl implements \w. Other 
implementations like .NET seem to differ slightly [3]. So we could 
lookup Unicode categories with utf8proc and then a perl-compatible check 
for a word character would be as easy as (cat <= 10 || cat == 12 || c >= 
0x24b6 && c <= 0x24e9).

The default regex in RegexTokenizer also handles apostrophes which I 
don't find very useful personally. But this could also be implemented in 
the core tokenizer.

I'm wondering what other kind of regexes people are using with 
RegexTokenizer, and whether this simple core tokenizer should be 
customizable for some of these use cases.

Nick

[1] http://www.unicode.org/reports/tr18/#Compatibility_Properties
[2] http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
[3] http://msdn.microsoft.com/en-us/library/20bw873z.aspx#WordCharacter

Mime
View raw message