lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: Tokenizer Question
Date Mon, 05 Jan 2009 17:35:13 GMT
Hi ayyanar,

On 01/05/2009 at 12:23 PM, ayyanar wrote:
> I need a tokenizer that tokenizes a keyword as follows: Consider an
> example "President day" - this should be tokenized as "President day",
> "President", "Day" This seems to be a functionality of a keyword
> tokenizer and whitespace tokenizer Do we have any tokenizer that does
> this job or we need to write a custom one?

A ShingleFilter <http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/shingle/ShingleFilter.html>
over a whitespace tokenizer should do the trick.  By default, unigrams (individual terms)
are output in addition to shingles (token n-grams).

Steve


Mime
View raw message