lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lucenius <>
Subject CJK evaluation. Standardanalyzer and Querytime.
Date Mon, 18 Feb 2013 20:29:52 GMT
Hello community,

i am doing an evaluation in the context of CJK. I compare some indexing
strategies like "unigram", "bigram", "unigram + bigram" and "word based"

I used the Standardanalyzer for "unigram". I think it works for chinese but
it is doing some other staff for Japanese and Korean. In Japanese some
characters get combined and for Korean it works like a WhiteSpaceAnalyzer,
right? Which Analyzer would you prefer for "unigrams" in Japanese and
Korean? Is there any flag in the CJKAnalyzer to output "unigrams" only?

I used the CJKAnalyzer for "bigrams" and "unigrams + bigrams". I think it
works correct, but i have some performance issues. The Querytime for
"unigram + bigram" is about 8-20 times higher than "bigram" only. Any ideas?

Thank you.

View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message