Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-dev@incubator.apache.org
Received-SPF: pass (nike.apache.org: local policy)
Date: Tue, 8 Mar 2011 14:02:00 -0800
From: Marvin Humphrey <marvin@rectangular.com>
To: lucy-dev@incubator.apache.org
Message-ID: <20110308220200.GA22239@rectangular.com>
References: <20110308173603.GA21683@rectangular.com>
 <AANLkTikkivRHT1GJBjec3DAGAG-eZ=P+Q1kyWTfQ=b-o@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <AANLkTikkivRHT1GJBjec3DAGAG-eZ=P+Q1kyWTfQ=b-o@mail.gmail.com>
User-Agent: Mutt/1.5.18 (2008-05-17)
Subject: Re: [lucy-dev] RegexTokenizer

On Tue, Mar 08, 2011 at 11:36:43AM -0800, Nathan Kurz wrote:
> Once each index becomes specific to each host language, wouldn't you lose
> the ability to create the index in one language and access it from another?   

Indexes are specific to the host language right now, since Tokenizer uses
Perl's regex engine and CaseFolder uses Perl's lowercasing (which is imperfect
in its implementation of the Unicode case-folding algorithm).  I'm not
personally planning to work on that prior to 0.1.0.

> While there is some advantage to having all the tokenizing be host native, I
> think there is greater value in being able to do create the index with a
> good text processing language (Perl in my case) while being able to perform
> the searches from a compiled language (likely C).

I agree that that such cross-host-language flexibility would be a nice option.
I also think it's important that we not lard up core with mandatory
dependencies.  Rather than add PCRE, I'd prefer to focus on extracting
Snowball!  A C application should be able to link in only the Lucy modules it
needs.

> I'd suggest instead that RegexTokenizer be host-independent and use
> something like PCRE.  While this might make for a few odd corner cases, I
> think it will work better in multilingual projects.   

Well, so long as a "PCRETokenizer" is available as a module, those who require
cross-host-language compatibility can get what they need.  So the main
question is whether we should *stop* providing an analyzer which uses the host
regex engine.

I'd actually prefer to pull *all* of the Analyzers out of core.  That's what
Lucene has done, with Robert Muir doing most of the work to put everything
into a "modules" directory.  

But that's a larger discussion and more than I want to take on prior to 0.1.0.
Right now, reserving the name "Tokenizer" is my priority.

> do you view the (future) C API as distinct from Lucy Core?

That's the way the design looks at the moment.  Not all the functions declared
in the header files within trunk/core/ have bodies defined within trunk/core/
-- some of the implementations are within trunk/perl/xs/ and we would need
analogous implementations within trunk/c/.

The design isn't set in stone, though.  The port to C isn't finished, and I
expect that we'll need to make adjustments as we add other bindings.

> >  If we try to specify the regex dialect precisely so that the tokenization
> >  behavior is fully defined by the serialized analyzer within the schema
> >  file, the only remedy on mismatch will be to throw an exception and
> >  refuse to read the index.
> 
> I'm not getting this.  Is there a failure other than not finding token
> you search for?  

I'm guessing that there are regexes which are legal in one host but syntax
errors in another... but silent failure to match is indeed my main concern.

If we specify that "PerlRegexTokenizer" has the behavior of the regex engine
in Perl 5.10.1, what happens when we load Lucy in Perl 5.12.2 or 5.8.9?
Should we attempt to translate and provide full feature-compatibility and
bug-compatibility?  No way that would work.

This same problem affects Java Lucene when you change your JVM and the new one
has a different version of Unicode.

Marvin Humphrey