lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Vasilev <ivasi...@sirma.bg>
Subject Is there bug in CJKAnalyzer?
Date Mon, 22 Oct 2007 15:39:05 GMT
Hi Guys,

I have made tests with the CJKAnalyzer and the results show something 
that seems very strange to me. First I have to say that I do not 
understand non of the CJK languages.
What I do is the following I write some text in English and translate it 
using an on-line tool, which give me the translated script per word or 
per group of words. The translated text I put in separate files and 
index them using proper encoding for readers.
What is strange is that when searching just one hieroglyph (no matter if 
it is separate word in the text or part of a word) Lucene almost never 
finds result (may be only in less than 5% find results for word like – 
that=那, commas and so).
I also copy/pasted text from Chinese Academy of Science web site to 
ignore results in case the translation toll does not work correctly. The 
result is the same.
But when searching for two or more consequent hieroglyphs everything is 
OK if they persist in the text they are found.

So my question is: Is this normal behavior for CJKAnalyzer – not to find 
results when only one hieroglyph is searched or there is some bug with 
that Analyzer?

I also would like to say that I reindexed with a very simple class (not 
with our searching engine) to ignore any possible mistakes. The results 
are the same.

I will give the example of the text that I use:

English:

The quick brown fox jumped over the lazy dog.

Chinese:

灵布朗狐逾懒狗。

English word by word:

|NA The |1 quick |2 brown |3 fox |4 jumped over |NA the |5 lazy |6 dog |7.

Responding Chinese words:

|1 灵 |2 布朗 |3 狐 |4 逾 |5 懒 | 6 狗 |7。

NOTE: My files contain only the Chinese text.

Best Regards,
Ivan


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message