lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Vasilev <ivasi...@sirma.bg>
Subject Re: Is there bug in CJKAnalyzer?
Date Wed, 24 Oct 2007 10:33:11 GMT
Hi Steven,

Thank you very match for your answer.
I tested with the StandardAnalyzer and it really tokenizes the text
ideograph by ideograph. May be as Samir says in his mail this is not
convenient for people who use CJK language because too lot of documents
will match. But the think is in this case (when using StandardAnalyzer)
the range searches work correctly. I tested it. The logic is the same as
in English range searches. If in English you have the word "brown" and
some tokenizer tokenizes it letter by letter like this: 'b' 'r' 'o' 'w'
'n', and then you can search for more than one character. For example
consider the following search - content:[aaa TO ccc] - then the token
'b' will be found.
Yes for letter based languages it does not make sense to tokenize letter
by letter, of course. But in CJK in great number of cases, as I know,
single ideographs are separate words, or even group of words.
I tested range searches of the Chinese text indexed with
StandardAnalyzer and everything in this context is OK.
The searches:
content:[\u0E80 TO 的\u0E80]
content:[\u0E80\u0E80 TO 的\u0E80]
content:[\u0E80\u0E80\u0E80 TO 的\u0E80\u0E80]
content:[\u0E80\u0E80\u0E80 TO 的\u0E80\u0E80]

not only work but return the same result set as:
content:[\u0E80 TO 的]

Here \u0E80 is the first ideograph of CJK Unicode code points and 的 is
some ideograph persisting in some of the text files.
This of course works also with the CJKAnalyzer. But with
StandardAnalyzer will be avoided, I think, the case that I describe in
my previous mail.

So I know range searches are a bit slower but I just fulfil the
requirement of our customers. They will decide if range searches are
convenient or not and whet Analyzer will better help them.

Thanks once again :)

Best Regards,
Ivan

Steven Rowe wrote:
> Hi Ivan,
>
> Ivan Vasilev wrote:
>   
>> But how to understand the meaning of this: “To overcome this, you
>> have to index chinese characters as single tokens (this will increase
>> recall, but decrease precision).”
>>
>> I understand it so: To increase the results I have to use instead of 
>> the Chinese another analyzer that makes tokenization of the text 
>> character by character.
>>     
>
> StandardTokenizer[1] produces single-character tokens for Chinese
> ideographs and Japanese kana.
>
> However, AFAIK, you will no longer be able to perform range searches
> like [AG TO PQ], because the terms "AG" and "PQ" will not be present in
> the index.  [A TO P] should work, but I don't know how useful the
> results would be, since this would match all words that contain the
> ideographs [A TO P], not just those that start with them.  (Note that
> this is also the case with the bigram tokens produced by CJKAnalyzer.)
>
> By the way, what is the use case for matching a range of words?  Doesn't
> exposing this kind of functionality cause performance concerns?
>
> Steve
>
> [1] Lucene's StandardTokenizer API doc:
> <http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/analysis/standard/StandardTokenizer.html>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message