incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Markham <dmark...@gmail.com>
Subject Re: [lucy-dev] Implementing a tokenizer in core
Date Wed, 23 Nov 2011 01:09:49 GMT
quick grep in my code base I find these.

'[^ยก]+'            --- crazy unicode char to be unique
'[^\x{1}]+'     --- another crazy  unique char 
'\S+'              --- we use this a lot to not get hit by strings with hyphens in them.
'\w+(?:[\'\x{2019}]\w+)*'  -- the default



-Dan




On Nov 22, 2011, at 2:10 PM, Nick Wellnhofer wrote:

> Currently, Lucy only provides the RegexTokenizer which is implemented on top of the perl
regex engine. With the help of utf8proc we could implement a simple but more efficient tokenizer
without external dependencies in core. Most important, we'd have to implement something similar
to the \w regex character class. The Unicode standard [1,2] recommends that \w is equivalent
to [\pL\pM\p{Nd}\p{Nl}\p{Pc}\x{24b6}-\x{24e9}], that is Unicode categories Letter, Mark, Decimal_Number,
Letter_Number, and Connector_Punctuation plus circled letters. That's exactly how perl implements
\w. Other implementations like .NET seem to differ slightly [3]. So we could lookup Unicode
categories with utf8proc and then a perl-compatible check for a word character would be as
easy as (cat <= 10 || cat == 12 || c >= 0x24b6 && c <= 0x24e9).
> 
> The default regex in RegexTokenizer also handles apostrophes which I don't find very
useful personally. But this could also be implemented in the core tokenizer.
> 
> I'm wondering what other kind of regexes people are using with RegexTokenizer, and whether
this simple core tokenizer should be customizable for some of these use cases.
> 
> Nick
> 
> [1] http://www.unicode.org/reports/tr18/#Compatibility_Properties
> [2] http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
> [3] http://msdn.microsoft.com/en-us/library/20bw873z.aspx#WordCharacter


Mime
View raw message