lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ray Tsang" <>
Subject Re: The best Chinese Analyzer?
Date Mon, 08 May 2006 09:28:43 GMT
Hi Bob,

In short, I use a slightly modified ChineseAnalyzer to index chinese text.
They differ mainly in the way they tokenize the text.

StandardAnalyzer is inteded to use w/ Latin-based languages, that each
word composes of multiple characters, and each word is separated by
special markers such as a space ' ', a comma, a period, a new line...
etc.. so "C1C2C3" (space) "C4C5C6" will be tokenized into 2 terms:
"C1C2C3" and "C4C5C6"

CJKAnalyzer tokenizes Chinese text into 2-grams (from
"C1C2C3C4" -> "C1C2" "C2C3" "C3C4"

ChineseAnalyzer tokenizes Chinese text into 1-gram
"C1C2C3C4" -> "C1" "C2" "C2" "C3" "C3" "C4"

The most obvious result of these 3 tokenization tokenization
strategies is the search results.
Suppose you search for "C2C3", you can only find it w/
ChineseAnalyzer, but not the other 2 with the above example.


On 5/8/06, Bob Cheung <> wrote:
> I have a question for those who have used Lucene to index and search for
> Chinese Characters, what is the best Analyzer for the job?
> I know all these three can do the job:
> 1. StandardAnalyzer
> 2. CJKAnalyzer
> 3. ChineseAnalyzer
> What are the difference between these 3 analyzers?
> TIA.
> Regards,
> Bob
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
View raw message