Return-Path: Delivered-To: apmail-incubator-lucy-dev-archive@www.apache.org Received: (qmail 90258 invoked from network); 8 Mar 2011 22:03:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Mar 2011 22:03:30 -0000 Received: (qmail 15316 invoked by uid 500); 8 Mar 2011 22:03:29 -0000 Delivered-To: apmail-incubator-lucy-dev-archive@incubator.apache.org Received: (qmail 15285 invoked by uid 500); 8 Mar 2011 22:03:29 -0000 Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@incubator.apache.org Delivered-To: mailing list lucy-dev@incubator.apache.org Received: (qmail 15277 invoked by uid 99); 8 Mar 2011 22:03:29 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Mar 2011 22:03:29 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [68.116.39.62] (HELO rectangular.com) (68.116.39.62) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Mar 2011 22:03:21 +0000 Received: from marvin by rectangular.com with local (Exim 4.69) (envelope-from ) id 1Px4ye-0005qO-PL for lucy-dev@incubator.apache.org; Tue, 08 Mar 2011 14:02:00 -0800 Date: Tue, 8 Mar 2011 14:02:00 -0800 From: Marvin Humphrey To: lucy-dev@incubator.apache.org Message-ID: <20110308220200.GA22239@rectangular.com> References: <20110308173603.GA21683@rectangular.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.18 (2008-05-17) X-Virus-Checked: Checked by ClamAV on apache.org Subject: Re: [lucy-dev] RegexTokenizer On Tue, Mar 08, 2011 at 11:36:43AM -0800, Nathan Kurz wrote: > Once each index becomes specific to each host language, wouldn't you lose > the ability to create the index in one language and access it from another? Indexes are specific to the host language right now, since Tokenizer uses Perl's regex engine and CaseFolder uses Perl's lowercasing (which is imperfect in its implementation of the Unicode case-folding algorithm). I'm not personally planning to work on that prior to 0.1.0. > While there is some advantage to having all the tokenizing be host native, I > think there is greater value in being able to do create the index with a > good text processing language (Perl in my case) while being able to perform > the searches from a compiled language (likely C). I agree that that such cross-host-language flexibility would be a nice option. I also think it's important that we not lard up core with mandatory dependencies. Rather than add PCRE, I'd prefer to focus on extracting Snowball! A C application should be able to link in only the Lucy modules it needs. > I'd suggest instead that RegexTokenizer be host-independent and use > something like PCRE. While this might make for a few odd corner cases, I > think it will work better in multilingual projects. Well, so long as a "PCRETokenizer" is available as a module, those who require cross-host-language compatibility can get what they need. So the main question is whether we should *stop* providing an analyzer which uses the host regex engine. I'd actually prefer to pull *all* of the Analyzers out of core. That's what Lucene has done, with Robert Muir doing most of the work to put everything into a "modules" directory. But that's a larger discussion and more than I want to take on prior to 0.1.0. Right now, reserving the name "Tokenizer" is my priority. > do you view the (future) C API as distinct from Lucy Core? That's the way the design looks at the moment. Not all the functions declared in the header files within trunk/core/ have bodies defined within trunk/core/ -- some of the implementations are within trunk/perl/xs/ and we would need analogous implementations within trunk/c/. The design isn't set in stone, though. The port to C isn't finished, and I expect that we'll need to make adjustments as we add other bindings. > > If we try to specify the regex dialect precisely so that the tokenization > > behavior is fully defined by the serialized analyzer within the schema > > file, the only remedy on mismatch will be to throw an exception and > > refuse to read the index. > > I'm not getting this. Is there a failure other than not finding token > you search for? I'm guessing that there are regexes which are legal in one host but syntax errors in another... but silent failure to match is indeed my main concern. If we specify that "PerlRegexTokenizer" has the behavior of the regex engine in Perl 5.10.1, what happens when we load Lucy in Perl 5.12.2 or 5.8.9? Should we attempt to translate and provide full feature-compatibility and bug-compatibility? No way that would work. This same problem affects Java Lucene when you change your JVM and the new one has a different version of Unicode. Marvin Humphrey