Return-Path: X-Original-To: apmail-incubator-lucy-dev-archive@www.apache.org Delivered-To: apmail-incubator-lucy-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E726C9EA6 for ; Mon, 20 Feb 2012 12:52:57 +0000 (UTC) Received: (qmail 2792 invoked by uid 500); 20 Feb 2012 12:52:57 -0000 Delivered-To: apmail-incubator-lucy-dev-archive@incubator.apache.org Received: (qmail 2735 invoked by uid 500); 20 Feb 2012 12:52:57 -0000 Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@incubator.apache.org Delivered-To: mailing list lucy-dev@incubator.apache.org Received: (qmail 2722 invoked by uid 99); 20 Feb 2012 12:52:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Feb 2012 12:52:57 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [212.227.17.9] (HELO moutng.kundenserver.de) (212.227.17.9) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Feb 2012 12:52:47 +0000 Received: from [192.168.1.39] (dslb-088-066-121-078.pools.arcor-ip.net [88.66.121.78]) by mrelayeu.kundenserver.de (node=mrbap2) with ESMTP (Nemesis) id 0M0bdu-1SJtK91ySU-00ualf; Mon, 20 Feb 2012 13:52:27 +0100 Message-ID: <4F42420A.6070907@aevum.de> Date: Mon, 20 Feb 2012 13:52:26 +0100 From: Nick Wellnhofer User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2 MIME-Version: 1.0 To: lucy-dev@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:jI4KvSdtbG9ZHhjhqoROra6qxvVAQgimKbdmxeQnZbN BqUpZqLgwxOU6tQSxByxmNGkxv4xRovbdCEUMenlou78oXMWQ1 XOyDXUB96MSzx+3+BblYuTfuil79Habhh4I9hd1eWwI/jUuQwl wq73eqw8pn0pznbP4pWTinzDpeRQZzjbvT5MhY8YSzcBRhw4cp a4c3cnMKoNneyLsXA45MSqiNRFKWpC71rfFKFIpkvcWwQWBWwJ 3vkTzOt6CjfrYqvASTiRygg05tErwG2fwNBSPUzxZiUsuSsuq8 xKvyRVrxN80BKSA+aBm0HbNw9lehzyl0NrEbl6XmkR3s9Pmkg= = X-Virus-Checked: Checked by ClamAV on apache.org Subject: [lucy-dev] Extending the StandardTokenizer Currently, the new StandardTokenizer implements the word break algorithm as defined in Unicode Annex #29. One detail of this algorithm is that it defines a set of "MidLetter" and "MidNum" characters which don't break a sequence of letters or numbers. It seems the main reason is to not break around characters like apostrophes or number separators. While some people might prefer this behavior, I'd like to add second mode of operation that does split on all characters that are not alphanumeric with the exception of underscores. This would very much resemble a RegexTokenizer with a \w+ pattern. The whole thing could be implemented by simply adding an option to StandardTokenizer so that "MidLetter" and "MidNum" characters are ignored. Nick