lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Vasilev <ivasi...@sirma.bg>
Subject Re: Is there bug in CJKAnalyzer?
Date Tue, 23 Oct 2007 12:08:43 GMT
Thanks Samir :)

You info was really helpful for us. I saw the index by Luke and there
the Chinese signs were split in pairs as you said 每 AB BC CD etc. Also
when querying for ABC it is split in the query to AB BC.
But how to understand the meaning of this:
※To overcome this, you have to index chinese characters as single tokens
(this will increase recall, but decrease precision).§

I understand it so: To increase the results I have to use instead of the Chinese another analyzer
that makes tokenization of the text character by character. 

Do you know if the range searches  work correctly with CJK texts when during the indexing
was used CJKAnalyzer? My opinion is that as the terms are sorted by using String.copmareTo
method but not Collator.compare then the range searches will as close to correct as the second
method is close to the first one.
Interesting is when passing more than 2 Chinese characters 每 then the Lucene query is not
split to bigram sub queries (as in the normal queries) and may be some incorrect results can
come because of this.

Example: there is indexed the text: AAB. The index will contain: AA AB.
User range search is: content:[AG TO PQ]
Here may be B will mean some word, and user will expect it in the results (as B is after AG
and before PQ), but as it is not single token but in pair AB which is before AG, it will not
be returned.
Here may be I guess your answer 每 to prevent this use per character analyzer :) 

Thanks once again Samir for your explanation of how CJKAnalyzer works it was really calming
for me to know that my CJKAnalyzer work as it is expected. 

Best Regards,
Ivan



Samir Abdou wrote:
> Hi,
>
> For a chinese token like ABCD (where A,B,C and D are chinese signs),
> CJKAnalyzer will generate the following overlapping bigrams: AB  BC  CD.
> Thus issuing a query containing one chinese sign will not retrieve any
> documents.  To overcome this, you have to index chinese characters as single
> tokens (this will increase recall, but decrease precision).
>
> Hope this will help,
> Samir
>
>
>
> 2007/10/22, Ivan Vasilev <ivasilev@sirma.bg>:
>   
>> Hi Guys,
>>
>> I have made tests with the CJKAnalyzer and the results show something
>> that seems very strange to me. First I have to say that I do not
>> understand non of the CJK languages.
>> What I do is the following I write some text in English and translate it
>> using an on-line tool, which give me the translated script per word or
>> per group of words. The translated text I put in separate files and
>> index them using proper encoding for readers.
>> What is strange is that when searching just one hieroglyph (no matter if
>> it is separate word in the text or part of a word) Lucene almost never
>> finds result (may be only in less than 5% find results for word like 每
>> that=饒, commas and so).
>> I also copy/pasted text from Chinese Academy of Science web site to
>> ignore results in case the translation toll does not work correctly. The
>> result is the same.
>> But when searching for two or more consequent hieroglyphs everything is
>> OK if they persist in the text they are found.
>>
>> So my question is: Is this normal behavior for CJKAnalyzer 每 not to find
>> results when only one hieroglyph is searched or there is some bug with
>> that Analyzer?
>>
>> I also would like to say that I reindexed with a very simple class (not
>> with our searching engine) to ignore any possible mistakes. The results
>> are the same.
>>
>> I will give the example of the text that I use:
>>
>> English:
>>
>> The quick brown fox jumped over the lazy dog.
>>
>> Chinese:
>>
>> 鍾票檄緒貣擱僩﹝
>>
>> English word by word:
>>
>> |NA The |1 quick |2 brown |3 fox |4 jumped over |NA the |5 lazy |6 dog |7.
>>
>> Responding Chinese words:
>>
>> |1 鍾 |2 票檄 |3 緒 |4 貣 |5 擱 | 6 僩 |7﹝
>>
>> NOTE: My files contain only the Chinese text.
>>
>> Best Regards,
>> Ivan
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message