Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 20648 invoked from network); 22 Oct 2007 15:40:13 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 22 Oct 2007 15:40:13 -0000 Received: (qmail 94936 invoked by uid 500); 22 Oct 2007 15:39:54 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 94822 invoked by uid 500); 22 Oct 2007 15:39:54 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 94811 invoked by uid 99); 22 Oct 2007 15:39:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Oct 2007 08:39:54 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [62.213.161.134] (HELO pmx.sirma.bg) (62.213.161.134) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Oct 2007 15:39:57 +0000 Received: from localhost (localhost [127.0.0.1]) by pmx.sirma.bg (Sirma mail system) with ESMTP id 2A33C2448EC for ; Mon, 22 Oct 2007 18:39:05 +0300 (EEST) X-Virus-Scanned: amavisd-new at sirma.bg Received: from pmx.sirma.bg ([127.0.0.1]) by localhost (pmx.sirma.bg [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6anIuHOqOZfU for ; Mon, 22 Oct 2007 18:39:05 +0300 (EEST) Received: from [192.168.128.140] (ivasilev.sirma.int [192.168.128.140]) by pmx.sirma.bg (Sirma mail system) with ESMTP id 14B8B2448E5 for ; Mon, 22 Oct 2007 18:39:05 +0300 (EEST) Message-ID: <471CC419.4020407@sirma.bg> Date: Mon, 22 Oct 2007 18:39:05 +0300 From: Ivan Vasilev User-Agent: Thunderbird 2.0.0.6 (Windows/20070728) MIME-Version: 1.0 To: LUCENE MAIL LIST Subject: Is there bug in CJKAnalyzer? Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org Hi Guys, I have made tests with the CJKAnalyzer and the results show something that seems very strange to me. First I have to say that I do not understand non of the CJK languages. What I do is the following I write some text in English and translate it using an on-line tool, which give me the translated script per word or per group of words. The translated text I put in separate files and index them using proper encoding for readers. What is strange is that when searching just one hieroglyph (no matter if it is separate word in the text or part of a word) Lucene almost never finds result (may be only in less than 5% find results for word like – that=那, commas and so). I also copy/pasted text from Chinese Academy of Science web site to ignore results in case the translation toll does not work correctly. The result is the same. But when searching for two or more consequent hieroglyphs everything is OK if they persist in the text they are found. So my question is: Is this normal behavior for CJKAnalyzer – not to find results when only one hieroglyph is searched or there is some bug with that Analyzer? I also would like to say that I reindexed with a very simple class (not with our searching engine) to ignore any possible mistakes. The results are the same. I will give the example of the text that I use: English: The quick brown fox jumped over the lazy dog. Chinese: 灵布朗狐逾懒狗。 English word by word: |NA The |1 quick |2 brown |3 fox |4 jumped over |NA the |5 lazy |6 dog |7. Responding Chinese words: |1 灵 |2 布朗 |3 狐 |4 逾 |5 懒 | 6 狗 |7。 NOTE: My files contain only the Chinese text. Best Regards, Ivan --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org