lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lance Norskog (JIRA)" <>
Subject [jira] [Commented] (LUCENE-4286) Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams
Date Mon, 06 Aug 2012 22:07:02 GMT


Lance Norskog commented on LUCENE-4286:

If you do unigrams and bigrams in separate fields, you can bias bigrams over unigrams. We
did that with one customer and it really helped. Our text was technical and tended towards
"long" words: lots of bigrams & trigrams. Have you tried the Smart Chinese toolkit? It
produces a lot less bigrams. Our project worked well with it. I would try that, with misfires
further broken into bigrams, over general bigramming. C.f. [SOLR-3653] about the "misfires"

In general we found Chinese-language search a really hard problem, and doubly so when nobody
on the team speaks Chinese. 

> Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams
> -------------------------------------------------------------------------
>                 Key: LUCENE-4286
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 4.0-ALPHA, 3.6.1
>            Reporter: Tom Burton-West
>            Priority: Minor
>             Fix For: 4.0, 5.0
>         Attachments: LUCENE-4286.patch, LUCENE-4286.patch
> Add an optional  flag to the CJKBigramFilter to tell it to also output unigrams.   This
would allow indexing of both bigrams and unigrams and at query time the analyzer could analyze
queries as bigrams unless the query contained a single Han unigram.
> As an example here is a configuration a Solr fieldType with the analyzer for indexing
with the "indexUnigrams" flag set and the analyzer for querying without the flag. 
> <fieldType name="CJK" autoGeneratePhraseQueries="false">
> −
> <analyzer type="index">
>    <tokenizer class="solr.ICUTokenizerFactory"/>
>    <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
> </analyzer>
> <analyzer type="query">
>    <tokenizer class="solr.ICUTokenizerFactory"/>
>    <filter class="solr.CJKBigramFilterFactory" han="true"/>
> </analyzer>
> </fieldType>
> Use case: About 10% of our queries that contain Han characters are single character queries.
  The CJKBigram filter only outputs single characters when there are no adjacent bigrammable
characters in the input.  This means we have to create a separate field to index Han unigrams
in order to address single character queries and then write application code to search that
separate field if we detect a single character Han query.  This is rather kludgey.  With the
optional flag, we could configure Solr as above  
> This is somewhat analogous to the flags in LUCENE-1370 for the ShingleFilter used to
allow single word queries (although that uses word n-grams rather than character n-grams.)

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message