Return-Path: X-Original-To: apmail-incubator-lucy-dev-archive@www.apache.org Delivered-To: apmail-incubator-lucy-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BFE2978DF for ; Fri, 25 Nov 2011 22:36:01 +0000 (UTC) Received: (qmail 53928 invoked by uid 500); 25 Nov 2011 22:36:01 -0000 Delivered-To: apmail-incubator-lucy-dev-archive@incubator.apache.org Received: (qmail 53858 invoked by uid 500); 25 Nov 2011 22:36:01 -0000 Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@incubator.apache.org Delivered-To: mailing list lucy-dev@incubator.apache.org Received: (qmail 53850 invoked by uid 99); 25 Nov 2011 22:36:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Nov 2011 22:36:00 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.210.41] (HELO mail-pz0-f41.google.com) (209.85.210.41) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Nov 2011 22:35:55 +0000 Received: by pzk37 with SMTP id 37so2499575pzk.0 for ; Fri, 25 Nov 2011 14:35:33 -0800 (PST) Received: by 10.68.36.103 with SMTP id p7mr30801289pbj.74.1322260533144; Fri, 25 Nov 2011 14:35:33 -0800 (PST) MIME-Version: 1.0 Received: by 10.142.74.10 with HTTP; Fri, 25 Nov 2011 14:35:02 -0800 (PST) In-Reply-To: <20111123025026.GA23517@rectangular.com> References: <4ECC1DF3.7020602@aevum.de> <20111123025026.GA23517@rectangular.com> From: Nathan Kurz Date: Fri, 25 Nov 2011 14:35:02 -0800 Message-ID: To: lucy-dev@incubator.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [lucy-dev] Implementing a tokenizer in core On Tue, Nov 22, 2011 at 6:50 PM, Marvin Humphrey w= rote: > I don't think we need to worry much about making this tokenizer flexible.= =C2=A0We > already offer a certain amount of flexibility via RegexTokenizer. I agree with this. I think the number of people that need an extremely efficient tokenizer that is also extremely flexible is low. Keep RegexTokenizer as the flexible option, and write this alternative for greater performance. Rather than making it completely configurable, put the emphasis on making it clear, simple, and independent of the inner workings of Lucy. Maybe put it in LucyX (API dogfood), and let it serve as an example for anyone who wants to write their own. My tokenizing needs are theoretical at this point, but the areas that I care about involve tokenizing white space, capitalization, and markup. I'd like to discourage a quoted search for "Proper Name" from matching "is that proper?
\nName your price," and I think the easiest way to do this is by indexing some things that would normally be ignored. I also care about punctuation such as Marvin's "Maggie's Farm" apostrophe example, as well as things like like "hyphenated-compound", "C++", "U.S.A.". --nate