lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Rowe <sar...@syr.edu>
Subject Re: Is there bug in CJKAnalyzer?
Date Tue, 23 Oct 2007 15:38:47 GMT
Hi Ivan,

Ivan Vasilev wrote:
> But how to understand the meaning of this: “To overcome this, you
> have to index chinese characters as single tokens (this will increase
> recall, but decrease precision).”
> 
> I understand it so: To increase the results I have to use instead of 
> the Chinese another analyzer that makes tokenization of the text 
> character by character.

StandardTokenizer[1] produces single-character tokens for Chinese
ideographs and Japanese kana.

However, AFAIK, you will no longer be able to perform range searches
like [AG TO PQ], because the terms "AG" and "PQ" will not be present in
the index.  [A TO P] should work, but I don't know how useful the
results would be, since this would match all words that contain the
ideographs [A TO P], not just those that start with them.  (Note that
this is also the case with the bigram tokens produced by CJKAnalyzer.)

By the way, what is the use case for matching a range of words?  Doesn't
exposing this kind of functionality cause performance concerns?

Steve

[1] Lucene's StandardTokenizer API doc:
<http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/analysis/standard/StandardTokenizer.html>

-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message