lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiyang Chen <>
Subject Tokenize a dictionary of phrases
Date Sun, 21 Aug 2011 19:53:19 GMT

I have a dictionary of multi-word phrases and I'd like to analyze documents such that anything
that appears in the dictionary will be treated as one single token. 
For example, if the dictionary contains "brown fox", then the sentence
The quick brown fox jumps over the lazy dog.

Will be tokenized as (with stopwords stripped):
quick | brown fox | jumps | lazy | dog

What is the best way to achieve this?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message