lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Kurz <>
Subject Re: [lucy-dev] RegexTokenizer
Date Tue, 08 Mar 2011 19:36:43 GMT
On Tue, Mar 8, 2011 at 9:36 AM, Marvin Humphrey <> wrote:
> Therefore, I think we should just have a single class named "RegexTokenizer"
> which is defined as deferring to the host language's regex engine.  Managing
> portability across different host languages or different versions of the host
> language will be left to the user.

Maybe I'm misunderstanding, but I'd suggest thinking really closely
before doing this.

I think one of the strengths of Lucy's host-core split is that the
core remains language agnostic.  Once each index becomes specific to
each host language, wouldn't you lose the ability to create the index
in one language and access it from another?   While there is some
advantage to having all the tokenizing be host native, I think there
is greater value in being able to do create the index with a good text
processing language (Perl in my case) while being able to perform the
searches from a compiled language (likely C).

I'd suggest instead that RegexTokenizer be host-independent and use
something like PCRE.  While this might make for a few odd corner
cases, I think it will work better in multilingual projects.   Make it
easy to switch to a different tokenizer, but provide something built
in that can be used standalone.  But maybe this is a philosophical
rather than practical problem:  do you view the (future) C API as
distinct from Lucy Core?  If one wanted to wrap the core up to act as
a freestanding HTTP or 0mq server, what would the "host language" be?

>  If we try to specify
> the regex dialect precisely so that the tokenization behavior is fully defined
> by the serialized analyzer within the schema file, the only remedy on mismatch
> will be to throw an exception and refuse to read the index.

I'm not getting this.  Is there a failure other than not finding token
you search for?  I think I can envision cases where you might
consciously want to different tokenizers working on the same index:
stemming one and not the other, or maybe even indexing bi-grams as a
means of boosting ad hoc phrase queries.

Nathan Kurz

View raw message