lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Burton-West, Tom" <tburt...@umich.edu>
Subject CJKBigram filter questons: single character queries, bigrams created across sript/character types
Date Fri, 27 Apr 2012 17:43:56 GMT
I have a few questions about the CJKBigram filter.

About 10% of our queries that contain Han characters are single character queries.   It looks
like the CJKBigram filter only outputs single characters when there are no adjacent bigrammable
characters in the input.   This means we would have to create a separate field to index Han
unigrams in order to address single character queries.  Is this correct?

For Japanese, the default settings form bigrams across character types.  So for a string containing
Hiragana and Han characters bigrams containing a mixture of Hiragana and Han characters are
formed:
いろは革命歌   =>    “いろ” ”ろは“  “は革”   ”革命” “命歌”

Is there a way to specify that you don’t want bigrams across character types?

Tom

Tom Burton-West
Digital Library Production Service
University of Michigan Library

http://www.hathitrust.org/blogs/large-scale-search


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message