lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Samir Abdou" <abdou.sa...@gmail.com>
Subject Re: Is there bug in CJKAnalyzer?
Date Mon, 22 Oct 2007 16:29:13 GMT
Hi,

For a chinese token like ABCD (where A,B,C and D are chinese signs),
CJKAnalyzer will generate the following overlapping bigrams: AB  BC  CD.
Thus issuing a query containing one chinese sign will not retrieve any
documents.  To overcome this, you have to index chinese characters as single
tokens (this will increase recall, but decrease precision).

Hope this will help,
Samir



2007/10/22, Ivan Vasilev <ivasilev@sirma.bg>:
>
> Hi Guys,
>
> I have made tests with the CJKAnalyzer and the results show something
> that seems very strange to me. First I have to say that I do not
> understand non of the CJK languages.
> What I do is the following I write some text in English and translate it
> using an on-line tool, which give me the translated script per word or
> per group of words. The translated text I put in separate files and
> index them using proper encoding for readers.
> What is strange is that when searching just one hieroglyph (no matter if
> it is separate word in the text or part of a word) Lucene almost never
> finds result (may be only in less than 5% find results for word like C
> that=, commas and so).
> I also copy/pasted text from Chinese Academy of Science web site to
> ignore results in case the translation toll does not work correctly. The
> result is the same.
> But when searching for two or more consequent hieroglyphs everything is
> OK if they persist in the text they are found.
>
> So my question is: Is this normal behavior for CJKAnalyzer C not to find
> results when only one hieroglyph is searched or there is some bug with
> that Analyzer?
>
> I also would like to say that I reindexed with a very simple class (not
> with our searching engine) to ignore any possible mistakes. The results
> are the same.
>
> I will give the example of the text that I use:
>
> English:
>
> The quick brown fox jumped over the lazy dog.
>
> Chinese:
>
> 鲼ʺ
>
> English word by word:
>
> |NA The |1 quick |2 brown |3 fox |4 jumped over |NA the |5 lazy |6 dog |7.
>
> Responding Chinese words:
>
> |1  |2  |3  |4  |5  | 6  |7
>
> NOTE: My files contain only the Chinese text.
>
> Best Regards,
> Ivan
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message