lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Burton-West, Tom" <tburt...@umich.edu>
Subject Bigrams for CJK with ICUTokenizer ?
Date Fri, 04 Feb 2011 17:46:54 GMT
Hello all,

We are using the ICUTokenizer because we have documents in about 400 different languages.
  We are also setting autoGeneratePhraseQueries to false so that CJK and other languages that
don't use space to separate words won't get tokenized properly by the ICUTokenizer and then
the tokens automatically searched as a phrase.

 The ICUTokenizer emits unigrams for Chinese (HAN). We would prefer to use overlapping bigrams
as in the CJKAnalyzer.   Is it possible to configure the ICUTokenizer to emit overlapping
bigrams?

Alternatively, is there some way to put some filter in the filter chain after the ICUTokenizer
that would produce overlapping bigrams for CJK?

Tom Burton-West


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message