Return-Path: X-Original-To: apmail-incubator-lucy-dev-archive@www.apache.org Delivered-To: apmail-incubator-lucy-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3C49A9FFD for ; Wed, 23 Nov 2011 02:10:10 +0000 (UTC) Received: (qmail 16060 invoked by uid 500); 23 Nov 2011 02:10:10 -0000 Delivered-To: apmail-incubator-lucy-dev-archive@incubator.apache.org Received: (qmail 15975 invoked by uid 500); 23 Nov 2011 02:10:10 -0000 Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@incubator.apache.org Delivered-To: mailing list lucy-dev@incubator.apache.org Received: (qmail 15967 invoked by uid 99); 23 Nov 2011 02:10:10 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Nov 2011 02:10:10 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [76.96.59.243] (HELO qmta13.westchester.pa.mail.comcast.net) (76.96.59.243) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Nov 2011 02:10:01 +0000 Received: from omta16.westchester.pa.mail.comcast.net ([76.96.62.88]) by qmta13.westchester.pa.mail.comcast.net with comcast id 0Rqg1i0041uE5Es5DS9gYl; Wed, 23 Nov 2011 02:09:40 +0000 Received: from pekmac.local ([24.118.4.97]) by omta16.westchester.pa.mail.comcast.net with comcast id 0S9g1i00625bZyo3cS9geY; Wed, 23 Nov 2011 02:09:40 +0000 Message-ID: <4ECC55E2.9040901@peknet.com> Date: Tue, 22 Nov 2011 20:09:38 -0600 From: Peter Karman Reply-To: peter@peknet.com User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1 MIME-Version: 1.0 To: lucy-dev@incubator.apache.org References: <4ECC1DF3.7020602@aevum.de> In-Reply-To: <4ECC1DF3.7020602@aevum.de> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Subject: Re: [lucy-dev] Implementing a tokenizer in core Nick Wellnhofer wrote on 11/22/11 4:10 PM: > Currently, Lucy only provides the RegexTokenizer which is implemented on top of > the perl regex engine. With the help of utf8proc we could implement a simple but > more efficient tokenizer without external dependencies in core. Most important, > we'd have to implement something similar to the \w regex character class. The > Unicode standard [1,2] recommends that \w is equivalent to > [\pL\pM\p{Nd}\p{Nl}\p{Pc}\x{24b6}-\x{24e9}], that is Unicode categories Letter, > Mark, Decimal_Number, Letter_Number, and Connector_Punctuation plus circled > letters. That's exactly how perl implements \w. Other implementations like .NET > seem to differ slightly [3]. So we could lookup Unicode categories with utf8proc > and then a perl-compatible check for a word character would be as easy as (cat > <= 10 || cat == 12 || c >= 0x24b6 && c <= 0x24e9). > > The default regex in RegexTokenizer also handles apostrophes which I don't find > very useful personally. But this could also be implemented in the core tokenizer. > > I'm wondering what other kind of regexes people are using with RegexTokenizer, > and whether this simple core tokenizer should be customizable for some of these > use cases. When I use Lucy I use the default regex. That's mostly because I know my collections are en_US. AFAIK, a language|locale-aware tokenizer would need to discriminate "word" boundaries, for which \w might be too blunt an instrument. I agree that a core tokenizer would be a Good Thing. -- Peter Karman . http://peknet.com/ . peter@peknet.com