lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Kurz <>
Subject Re: [lucy-dev] Implementing a tokenizer in core
Date Fri, 25 Nov 2011 22:35:02 GMT
On Tue, Nov 22, 2011 at 6:50 PM, Marvin Humphrey <> wrote:
> I don't think we need to worry much about making this tokenizer flexible.  We
> already offer a certain amount of flexibility via RegexTokenizer.

I agree with this.  I think the number of people that need an
extremely efficient tokenizer that is also extremely flexible is low.
Keep RegexTokenizer as the flexible option, and write this alternative
for greater performance.  Rather than making it completely
configurable, put the emphasis on making it clear, simple, and
independent of the inner workings of Lucy.   Maybe put it in LucyX
(API dogfood), and let it serve as an example for anyone who wants to
write their own.

My tokenizing needs are theoretical at this point, but the areas that
I care about involve tokenizing white space, capitalization, and
markup.   I'd like to discourage a quoted search for "Proper Name"
from matching "is that proper?<br>\nName your price," and I think the
easiest way to do this is by indexing some things that would normally
be ignored.   I also care about punctuation such as Marvin's "Maggie's
Farm" apostrophe example, as well as things like  like
"hyphenated-compound", "C++", "U.S.A.".


View raw message