Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-dev@incubator.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-ID: <4F42420A.6070907@aevum.de>
Date: Mon, 20 Feb 2012 13:52:26 +0100
From: Nick Wellnhofer <wellnhofer@aevum.de>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2
MIME-Version: 1.0
To: lucy-dev@incubator.apache.org
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: [lucy-dev] Extending the StandardTokenizer


Currently, the new StandardTokenizer implements the word break algorithm 
as defined in Unicode Annex #29. One detail of this algorithm is that it 
defines a set of "MidLetter" and "MidNum" characters which don't break a 
sequence of letters or numbers. It seems the main reason is to not break 
around characters like apostrophes or number separators.

While some people might prefer this behavior, I'd like to add second 
mode of operation that does split on all characters that are not 
alphanumeric with the exception of underscores. This would very much 
resemble a RegexTokenizer with a \w+ pattern.

The whole thing could be implemented by simply adding an option to 
StandardTokenizer so that "MidLetter" and "MidNum" characters are ignored.

Nick