lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Add flag to CJKBigramFilter to also output unigrams (Single character Han queries)
Date Fri, 03 Aug 2012 22:22:09 GMT
Tom, please open an issue for this.

On Fri, Aug 3, 2012 at 6:19 PM, Tom Burton-West <tburtonw@umich.edu> wrote:
> Hello all,
>
> About 10% of our queries that contain Han characters are single character
> queries.   It looks like the CJKBigram filter only outputs single characters
> when there are no adjacent bigrammable characters in the input.   This means
> we have to create a separate field to index Han unigrams in order to address
> single character queries and then write application code to search that
> separate field if we detect a single character Han query.  This is rather
> kludgey.    As an alternative approach to dealing with single character Han
> queryies, would it be possible to add an optional  flag to the
> CJKBigramFilter to tell it to also output unigrams?
>
> That way on indexing we could set the flag so that both unigrams and bigrams
> would be indexed.  On querying we would not set the flag so that the current
> logic which outputs bigrams unless there is a single Han character (in which
> case that gets output) would take care of queries containing a single Han
> unigram.
>
> This is somewhat analogus to the flags in LUCENE-1370 for the ShingleFilter.
>
> If this makes sense I'll open a JIRA issue.
>
> Tom Burton-West



-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message