lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject [lucy-dev] RegexTokenizer
Date Tue, 08 Mar 2011 17:36:03 GMT

Right now, Lucy only has one tokenizer-style Analyzer subclass:
Lucy::Analysis::Tokenizer, which is regex based.  

At some point, I expect we will have other tokenizer classes which don't use a
regex engine, so I think it would be best to reserve the name "Tokenizer" for
future use and rename the current Tokenizer to "RegexTokenizer".

Another possibility would be "PerlRegexTokenizer", embedding the regex dialect
that will be used to interpret the supplied pattern in the class name.
However, the exact behavior of the regular expression engine is not consistent
across different versions of Perl.  In general, it's not going to be possible
to translate a pattern between different regex engines.  If we try to specify
the regex dialect precisely so that the tokenization behavior is fully defined
by the serialized analyzer within the schema file, the only remedy on mismatch
will be to throw an exception and refuse to read the index.

Therefore, I think we should just have a single class named "RegexTokenizer"
which is defined as deferring to the host language's regex engine.  Managing
portability across different host languages or different versions of the host
language will be left to the user.

Marvin Humphrey

View raw message