lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lucenius <JamTheK...@hotmail.de>
Subject CJK evaluation. Standardanalyzer and Querytime.
Date Mon, 18 Feb 2013 20:29:52 GMT
Hello community,

i am doing an evaluation in the context of CJK. I compare some indexing
strategies like "unigram", "bigram", "unigram + bigram" and "word based"
indexing.

1.
I used the Standardanalyzer for "unigram". I think it works for chinese but
it is doing some other staff for Japanese and Korean. In Japanese some
characters get combined and for Korean it works like a WhiteSpaceAnalyzer,
right? Which Analyzer would you prefer for "unigrams" in Japanese and
Korean? Is there any flag in the CJKAnalyzer to output "unigrams" only?

2.
I used the CJKAnalyzer for "bigrams" and "unigrams + bigrams". I think it
works correct, but i have some performance issues. The Querytime for
"unigram + bigram" is about 8-20 times higher than "bigram" only. Any ideas?

Thank you.




--
View this message in context: http://lucene.472066.n3.nabble.com/CJK-evaluation-Standardanalyzer-and-Querytime-tp4041190.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message