lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomoko Uchida <>
Subject Re: Chinese sorting
Date Thu, 18 Dec 2014 18:16:04 GMT
Yes, sorting Kanji is not so easy as Hiragana/Kanji.

We simply expect that collators sort strings based on phonetics regardless
of how they written in (Hiragana, Katakana, Kanji.)
However a Kanji has multiple (usually 2 or 3) readings. We human naturally
make judgement which reading is suitable depending on the situation.
That makes things difficult. Maybe an ideal collator should behave and
judge like human.

Sorry for a long preamble,
I have tried ICUCollationKeyAnalyzer for Kanji, found "not so bad". Very
good compared to Unicode codepoint based sorting, but far from perfect.
I don't fully know the algorithm they use, but the accuracy might be
heavily depends on dictionaries/standards they have.

(Just an FYI,) Collators can take rules for adjustment.


2014-12-18 18:19 GMT+09:00 Nils Knappmeier <>:
> Hi Tomoko,
> does sorting with Locala.JAPANESE also work for Kanji. Since Hiragana and
> Katakana are based on the phonetics, I guess it is easier to define a
> sorting order. But Kanji is more similar to the Chinese.
> Thanks,
>   Nils
> On 17.12.2014 17:01, Tomoko Uchida wrote:
>> Hi, Nils,
>> I don't know Chinese at all... but collation is very important in Japanese
>> too.
>> Lucene has org.apache.lucene.collation package that use ICU4J's collators
>> (you can find "lucene-analyzers-icu-4.10.2.jar" in analysis/icu
>> directory).
>> html?org/apache/lucene/collation/package-summary.html
>> ICU4J also supports Chinese, of course.
>> I wrote a test program using ICUCollationKeyAnalyzer, it works well in
>> Japanese Hiragana/Katakana.
>> Here is a code snippet.
>> Analyzer collationAnalyzer = new
>> ICUCollationKeyAnalyzer(Version.LUCENE_4_10_2,
>> Collator.getInstance(Locale.JAPANESE));
>> IndexWriter writer = new IndexWriter(dir, new
>> IndexWriterConfig(Version.LUCENE_4_10_2, collationAnalyzer));
>> I understand collation is a very difficult problem, so I am not sure this
>> works for you...
>> I would appreciate if you share your trial/research.
>> Regards,
>> Tomoko
>> 2014-12-17 20:54 GMT+09:00 Nils Knappmeier <>:
>>> Hi,
>>> is there any implementation for a chinese collator in Lucene. I've seen
>>> that there is a chinese analyzer which uses Hidden Markov Models. But
>>> sorting seems to be an issue on its own and all my googling hasn't led to
>>> any results yet.
>>> I understand that this is not a trivial issue and I've read that the
>>> chinese tend to prefer other ordering than by name, since sorting orders
>>> are so complicated that nobody wants to use them. But we will have to
>>> sort
>>> search results by name, even though the name is chinese (simplified
>>> chinese
>>> at the moment, but traditional may also appear later) and currenty
>>> chinese
>>> words seem to be ordered by their unicode-number, which seems not to be
>>> the
>>> right order.
>>> Thanks in advance for any suggestion,
>>>   Nils

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message