lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew S. Townley" <>
Subject Re: [lucy-dev] RegexTokenizer
Date Tue, 08 Mar 2011 17:50:34 GMT

On 8 Mar 2011, at 5:36 PM, Marvin Humphrey wrote:

> Greets,
> Right now, Lucy only has one tokenizer-style Analyzer subclass:
> Lucy::Analysis::Tokenizer, which is regex based.  
> At some point, I expect we will have other tokenizer classes which don't use a
> regex engine, so I think it would be best to reserve the name "Tokenizer" for
> future use and rename the current Tokenizer to "RegexTokenizer".
> Another possibility would be "PerlRegexTokenizer", embedding the regex dialect
> that will be used to interpret the supplied pattern in the class name.
> However, the exact behavior of the regular expression engine is not consistent
> across different versions of Perl.  In general, it's not going to be possible
> to translate a pattern between different regex engines.  If we try to specify
> the regex dialect precisely so that the tokenization behavior is fully defined
> by the serialized analyzer within the schema file, the only remedy on mismatch
> will be to throw an exception and refuse to read the index.
> Therefore, I think we should just have a single class named "RegexTokenizer"
> which is defined as deferring to the host language's regex engine.  Managing
> portability across different host languages or different versions of the host
> language will be left to the user.
> Marvin Humphrey

Sounds like a reasonable approach.  Tokenizer for the interface and RegexTokenizer for platform-specific
regexes (which, in fairness, is kinda what people would expect anyway).

Many things support Perl5 regexes to varying degrees, so you'd likely not have too much trouble
from a portability perspective.  If you wanted to lock it in across host languages, then you
could always implement this in C using the library of your choice due to the architecture,


Andrew S. Townley <>

View raw message