lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-2906) Filter to process output of ICUTokenizer and create overlapping bigrams for CJK
Date Sun, 06 Feb 2011 17:44:33 GMT


Robert Muir commented on LUCENE-2906:

How will this differ from the SmartChineseAnalyzer?

The SmartChineseAnalyzer is for Simplified Chinese only... this is about the 
language-independent technique similar to what CJKAnalyzer does today.

I doubt it but can this be in 3.1?

Well i hate the way CJKAnalyzer treats things like supplementary characters (wrongly).
This is definitely a bug, and fixed here. Part of me wants to fix this as quickly as possible.

At the same time though, I would prefer 3.2... otherwise I would feel like I am rushing things.

I don't think 3.2 needs to come a year after 3.1... in fact since we have a stable branch
I think its
stupid to make bugfix releases like 3.1.1 when we could just push out a new minor version
(3.2) with
bugfixes instead. The whole branch is intended to be stable changes, so I think this is better
of our time. But this is just my opinion, we can discuss it later on the list as one idea
to promote 
more rapid releases.

> Filter to process output of ICUTokenizer and create overlapping bigrams for CJK 
> --------------------------------------------------------------------------------
>                 Key: LUCENE-2906
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Tom Burton-West
>            Priority: Minor
>         Attachments: LUCENE-2906.patch
> The ICUTokenizer produces unigrams for CJK. We would like to use the ICUTokenizer but
have overlapping bigrams created for CJK as in the CJK Analyzer.  This filter would take the
output of the ICUtokenizer, read the ScriptAttribute and for selected scripts (Han, Kana),
would produce overlapping bigrams.

This message is automatically generated by JIRA.
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message