incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <wellnho...@aevum.de>
Subject Re: [lucy-dev] Extending the StandardTokenizer
Date Wed, 22 Feb 2012 11:32:02 GMT
On 21/02/2012 05:46, Marvin Humphrey wrote:
> On Mon, Feb 20, 2012 at 08:07:00PM +0100, Nick Wellnhofer wrote:
>> One solution I've been thinking about is to make StandardTokenizer work
>> with arbitrary word break property tables. That is, use the rules
>> described in UAX#29 but allow for customized mappings of the word break
>> property which should cover many use cases.
>
> So it would be like specifying a Perl regex where you are only allowed to use
> \p{} constructs and a very limited set of properties.

What I had in mind is even more basic. I would keep the current 
implementation of StandardTokenizer and simply use different tables to 
look up the word break property.

>> This would basically mean to  port the code in devel/bin/UnicodeTable.pm to
>> C and provide a nice  public interface. It's certainly feasible but there
>> are some challenges involved, serialization for example.
>
> The serialization problem is solvable via subclassing.
>
> Initialize the data via a callback subroutine which the user must override in
> a subclass.  That way, the class name of the user subclass stands in as a
> symbol for all of its methods.  All you need in the Schema file is the name of
> the subclass and you're able to initialize the object completely so long as
> the class has been loaded.

Seems like a good idea.

>>> If that goal seems to far away, then my next suggestion would be to create a
>>> LucyX class to house a StandardTokenizer embellished with arbitrary extensions
>>> -- working name: LucyX::Analysis::NonStandardTokenizer.
>>
>> That would be OK with me.
>
> OK, then how about this?
>
> Create LucyX::Analysis::NonStandardTokenizer with a callback which handles
> assembling the specific unicode properties.  It may take a couple iterations
> to get the interface solid, but that's OK because LucyX classes come with
> lower expectations for backwards compat.
>
> If and when we decided that we've gotten the callback initialization API
> right, we can move the method up into StandardTokenizer and make
> NonStandardTokenizer a trivial subclass.
>
> For what it's worth, IMO you should feel free to mess with StandardTokenizer's
> internals while hacking up an implementation for NonStandardTokenizer.
> Everything's reversible so long as you don't change StandardTokenizer's
> interface, and the way I'm thinking you'd implement this, that seems like the
> easiest way.

Implementing completely arbitrary tables would be too much work for me 
at the moment. I'd prefer to write a custom tokenizer based on 
StandardTokenizer under the LucyX::Analysis namespace.

Nick

Mime
View raw message