lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Bigrams for CJK with ICUTokenizer ?
Date Fri, 04 Feb 2011 17:57:58 GMT
On Fri, Feb 4, 2011 at 12:46 PM, Burton-West, Tom <tburtonw@umich.edu> wrote:
> Hello all,
>
> We are using the ICUTokenizer because we have documents in about 400 different languages.
  We are also setting autoGeneratePhraseQueries to false so that CJK and other languages
that don't use space to separate words won't get tokenized properly by the ICUTokenizer and
then the tokens automatically searched as a phrase.
>
>  The ICUTokenizer emits unigrams for Chinese (HAN). We would prefer to use overlapping
bigrams as in the CJKAnalyzer.   Is it possible to configure the ICUTokenizer to emit overlapping
bigrams?
>
> Alternatively, is there some way to put some filter in the filter chain after the ICUTokenizer
that would produce overlapping bigrams for CJK?
>

Hi Tom, Let's open JIRA issue for this, we can add it.
The gist of it, is that ICUTokenizer sets a ScriptAttribute (an
integer) per token indicating its writing system.
So its easy to make an efficient filter that basically only "shingles"
on this attribute.

The reason there isnt one, is because I'd really like for us to
eventually somehow solve this with
https://issues.apache.org/jira/browse/LUCENE-2470

But for now, i think it would be good to be practical and add the
explicit filter (we can just mark the api experimental, hoping we will
make it more general with 2470) so people can easily get good out of
box performance in situations like yours.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message