lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <>
Subject Re: ICUTokenizer and CJK
Date Tue, 23 Nov 2010 11:07:25 GMT
On Mon, Nov 22, 2010 at 6:50 PM, Burton-West, Tom <> wrote:
> Hi all,
> I see in the javadoc for the ICUTokenizer that it has special handling for Lao,Myanmar,
Khmer word breaking but no details in the javadoc about what it does with CJK, which for C
and J appears to be breaking into unigrams. Is this correct?

The han ideographs are segmented into unigram (this is the uax#29
default behavior). I don't know off the top of my head what the rules
are for japanese kana...

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message