Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-dev@incubator.apache.org
Received-SPF: pass (nike.apache.org: local policy)
Date: Mon, 20 Feb 2012 20:46:48 -0800
From: Marvin Humphrey <marvin@rectangular.com>
To: lucy-dev@incubator.apache.org
Message-ID: <20120221044647.GA16236@rectangular.com>
References: <4F42420A.6070907@aevum.de>
 <20120220172737.GA14986@rectangular.com> <4F4299D4.9080803@aevum.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4F4299D4.9080803@aevum.de>
User-Agent: Mutt/1.5.18 (2008-05-17)
Subject: Re: [lucy-dev] Extending the StandardTokenizer

On Mon, Feb 20, 2012 at 08:07:00PM +0100, Nick Wellnhofer wrote:
>> To address the immediate concern, is it an option to just use RegexTokenizer
>> with a \w+ pattern?  RegexTokenizer's primary utility is that it solves many,
>> many use cases while posing a minimal ongoing maintenance burden.
>
> A plain \w+ pattern would work for me. I'm mainly interested in the  
> performance benefits of StandardTokenizer.

Thanks for being straightforward about that.

IMO, Lucene has an API and an implementation which are vastly larger than they
need to be because the code base has accreted a zillion micro-optimization
hooks over the years.  I believe that ultimately, Lucy will be able to catch
and pass Lucene for speed in part because Lucene's development is hampered by
the back compatibility burdens they've taken on accepting so many
nickel-and-dime speed tweaks.

However, I could be wrong about Lucy's potential because it may be that
undisciplined expansion is an inescapable consequence of the Apache
consensus-driven development model -- in which case Lucy development will
ultimately slow down for the same reasons. :)  It's going to be interesting to
find out.  

> Actually, you can formulate the complete UAX#29 word breaking rules as a  
> Perl regex which is even quite readable. But performance would probably  
> suffer even more because you'd have to use Perl's \p{} construct to  
> lookup word break properties.

Haha, that's neat to think about!

> One solution I've been thinking about is to make StandardTokenizer work  
> with arbitrary word break property tables. That is, use the rules  
> described in UAX#29 but allow for customized mappings of the word break  
> property which should cover many use cases.

So it would be like specifying a Perl regex where you are only allowed to use
\p{} constructs and a very limited set of properties.

> This would basically mean to  port the code in devel/bin/UnicodeTable.pm to
> C and provide a nice  public interface. It's certainly feasible but there
> are some challenges involved, serialization for example.

The serialization problem is solvable via subclassing.  

Initialize the data via a callback subroutine which the user must override in
a subclass.  That way, the class name of the user subclass stands in as a
symbol for all of its methods.  All you need in the Schema file is the name of
the subclass and you're able to initialize the object completely so long as
the class has been loaded.

>> If that goal seems to far away, then my next suggestion would be to create a
>> LucyX class to house a StandardTokenizer embellished with arbitrary extensions
>> -- working name: LucyX::Analysis::NonStandardTokenizer.
>
> That would be OK with me.

OK, then how about this?

Create LucyX::Analysis::NonStandardTokenizer with a callback which handles
assembling the specific unicode properties.  It may take a couple iterations
to get the interface solid, but that's OK because LucyX classes come with
lower expectations for backwards compat.

If and when we decided that we've gotten the callback initialization API
right, we can move the method up into StandardTokenizer and make
NonStandardTokenizer a trivial subclass.

For what it's worth, IMO you should feel free to mess with StandardTokenizer's
internals while hacking up an implementation for NonStandardTokenizer.
Everything's reversible so long as you don't change StandardTokenizer's
interface, and the way I'm thinking you'd implement this, that seems like the
easiest way.

> On another note, is it possible to package  Lucy extensions that contain C
> code outside of the main source tree?

There are three things preventing that right now.

  1) We have not published a public C API.
  2) We don't install the Lucy C headers.
  3) We need to work out support for systems which are strict about symbol
     exports, e.g. Cygwin, MSVC, theoretically AIX, etc.

That said, the intent is absolutely that there Lucy should support compiled
extensions (see LUCY-5: <http://s.apache.org/pSd>).  It's just that we haven't
completed this functionality because nobody has needed it bad enough.

Marvin Humphrey