lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ray Tsang" <saturn...@gmail.com>
Subject Re: The best Chinese Analyzer?
Date Mon, 08 May 2006 09:28:43 GMT
Hi Bob,

In short, I use a slightly modified ChineseAnalyzer to index chinese text.
They differ mainly in the way they tokenize the text.

StandardAnalyzer is inteded to use w/ Latin-based languages, that each
word composes of multiple characters, and each word is separated by
special markers such as a space ' ', a comma, a period, a new line...
etc.. so "C1C2C3" (space) "C4C5C6" will be tokenized into 2 terms:
"C1C2C3" and "C4C5C6"

CJKAnalyzer tokenizes Chinese text into 2-grams (from
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cjk/CJKTokenizer.java?rev=165565&view=markup)
"C1C2C3C4" -> "C1C2" "C2C3" "C3C4"

ChineseAnalyzer tokenizes Chinese text into 1-gram
(http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/ChineseTokenizer.java?rev=353930&view=markup)
"C1C2C3C4" -> "C1" "C2" "C2" "C3" "C3" "C4"

The most obvious result of these 3 tokenization tokenization
strategies is the search results.
Suppose you search for "C2C3", you can only find it w/
ChineseAnalyzer, but not the other 2 with the above example.

ray,


On 5/8/06, Bob Cheung <bob.cheung@sirsidynix.com> wrote:
> I have a question for those who have used Lucene to index and search for
> Chinese Characters, what is the best Analyzer for the job?
>
> I know all these three can do the job:
>
> 1. StandardAnalyzer
> 2. CJKAnalyzer
> 3. ChineseAnalyzer
>
> What are the difference between these 3 analyzers?
>
> TIA.
>
> Regards,
> Bob
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Mime
View raw message