Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-dev@incubator.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-ID: <4ECC55E2.9040901@peknet.com>
Date: Tue, 22 Nov 2011 20:09:38 -0600
From: Peter Karman <peter@peknet.com>
Reply-To: peter@peknet.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1
MIME-Version: 1.0
To: lucy-dev@incubator.apache.org
References: <4ECC1DF3.7020602@aevum.de>
In-Reply-To: <4ECC1DF3.7020602@aevum.de>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Subject: Re: [lucy-dev] Implementing a tokenizer in core

Nick Wellnhofer wrote on 11/22/11 4:10 PM:
> Currently, Lucy only provides the RegexTokenizer which is implemented on top of
> the perl regex engine. With the help of utf8proc we could implement a simple but
> more efficient tokenizer without external dependencies in core. Most important,
> we'd have to implement something similar to the \w regex character class. The
> Unicode standard [1,2] recommends that \w is equivalent to
> [\pL\pM\p{Nd}\p{Nl}\p{Pc}\x{24b6}-\x{24e9}], that is Unicode categories Letter,
> Mark, Decimal_Number, Letter_Number, and Connector_Punctuation plus circled
> letters. That's exactly how perl implements \w. Other implementations like .NET
> seem to differ slightly [3]. So we could lookup Unicode categories with utf8proc
> and then a perl-compatible check for a word character would be as easy as (cat
> <= 10 || cat == 12 || c >= 0x24b6 && c <= 0x24e9).
> 
> The default regex in RegexTokenizer also handles apostrophes which I don't find
> very useful personally. But this could also be implemented in the core tokenizer.
> 
> I'm wondering what other kind of regexes people are using with RegexTokenizer,
> and whether this simple core tokenizer should be customizable for some of these
> use cases.

When I use Lucy I use the default regex. That's mostly because I know my
collections are en_US. AFAIK, a language|locale-aware tokenizer would need to
discriminate "word" boundaries, for which \w might be too blunt an instrument.

I agree that a core tokenizer would be a Good Thing.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com