Return-Path: X-Original-To: apmail-incubator-lucy-dev-archive@www.apache.org Delivered-To: apmail-incubator-lucy-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 69EA69705 for ; Tue, 21 Feb 2012 05:00:04 +0000 (UTC) Received: (qmail 90608 invoked by uid 500); 21 Feb 2012 05:00:04 -0000 Delivered-To: apmail-incubator-lucy-dev-archive@incubator.apache.org Received: (qmail 90478 invoked by uid 500); 21 Feb 2012 05:00:03 -0000 Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@incubator.apache.org Delivered-To: mailing list lucy-dev@incubator.apache.org Received: (qmail 90410 invoked by uid 99); 21 Feb 2012 05:00:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Feb 2012 05:00:01 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [68.116.39.62] (HELO rectangular.com) (68.116.39.62) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Feb 2012 04:59:53 +0000 Received: from marvin by rectangular.com with local (Exim 4.69) (envelope-from ) id 1Rzhcm-0004EC-1F for lucy-dev@incubator.apache.org; Mon, 20 Feb 2012 20:46:48 -0800 Date: Mon, 20 Feb 2012 20:46:48 -0800 From: Marvin Humphrey To: lucy-dev@incubator.apache.org Message-ID: <20120221044647.GA16236@rectangular.com> References: <4F42420A.6070907@aevum.de> <20120220172737.GA14986@rectangular.com> <4F4299D4.9080803@aevum.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F4299D4.9080803@aevum.de> User-Agent: Mutt/1.5.18 (2008-05-17) X-Virus-Checked: Checked by ClamAV on apache.org Subject: Re: [lucy-dev] Extending the StandardTokenizer On Mon, Feb 20, 2012 at 08:07:00PM +0100, Nick Wellnhofer wrote: >> To address the immediate concern, is it an option to just use RegexTokenizer >> with a \w+ pattern? RegexTokenizer's primary utility is that it solves many, >> many use cases while posing a minimal ongoing maintenance burden. > > A plain \w+ pattern would work for me. I'm mainly interested in the > performance benefits of StandardTokenizer. Thanks for being straightforward about that. IMO, Lucene has an API and an implementation which are vastly larger than they need to be because the code base has accreted a zillion micro-optimization hooks over the years. I believe that ultimately, Lucy will be able to catch and pass Lucene for speed in part because Lucene's development is hampered by the back compatibility burdens they've taken on accepting so many nickel-and-dime speed tweaks. However, I could be wrong about Lucy's potential because it may be that undisciplined expansion is an inescapable consequence of the Apache consensus-driven development model -- in which case Lucy development will ultimately slow down for the same reasons. :) It's going to be interesting to find out. > Actually, you can formulate the complete UAX#29 word breaking rules as a > Perl regex which is even quite readable. But performance would probably > suffer even more because you'd have to use Perl's \p{} construct to > lookup word break properties. Haha, that's neat to think about! > One solution I've been thinking about is to make StandardTokenizer work > with arbitrary word break property tables. That is, use the rules > described in UAX#29 but allow for customized mappings of the word break > property which should cover many use cases. So it would be like specifying a Perl regex where you are only allowed to use \p{} constructs and a very limited set of properties. > This would basically mean to port the code in devel/bin/UnicodeTable.pm to > C and provide a nice public interface. It's certainly feasible but there > are some challenges involved, serialization for example. The serialization problem is solvable via subclassing. Initialize the data via a callback subroutine which the user must override in a subclass. That way, the class name of the user subclass stands in as a symbol for all of its methods. All you need in the Schema file is the name of the subclass and you're able to initialize the object completely so long as the class has been loaded. >> If that goal seems to far away, then my next suggestion would be to create a >> LucyX class to house a StandardTokenizer embellished with arbitrary extensions >> -- working name: LucyX::Analysis::NonStandardTokenizer. > > That would be OK with me. OK, then how about this? Create LucyX::Analysis::NonStandardTokenizer with a callback which handles assembling the specific unicode properties. It may take a couple iterations to get the interface solid, but that's OK because LucyX classes come with lower expectations for backwards compat. If and when we decided that we've gotten the callback initialization API right, we can move the method up into StandardTokenizer and make NonStandardTokenizer a trivial subclass. For what it's worth, IMO you should feel free to mess with StandardTokenizer's internals while hacking up an implementation for NonStandardTokenizer. Everything's reversible so long as you don't change StandardTokenizer's interface, and the way I'm thinking you'd implement this, that seems like the easiest way. > On another note, is it possible to package Lucy extensions that contain C > code outside of the main source tree? There are three things preventing that right now. 1) We have not published a public C API. 2) We don't install the Lucy C headers. 3) We need to work out support for systems which are strict about symbol exports, e.g. Cygwin, MSVC, theoretically AIX, etc. That said, the intent is absolutely that there Lucy should support compiled extensions (see LUCY-5: ). It's just that we haven't completed this functionality because nobody has needed it bad enough. Marvin Humphrey