lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomoko Uchida <>
Subject Re: Chinese sorting
Date Wed, 17 Dec 2014 16:01:37 GMT
Hi, Nils,

I don't know Chinese at all... but collation is very important in Japanese
Lucene has org.apache.lucene.collation package that use ICU4J's collators
(you can find "lucene-analyzers-icu-4.10.2.jar" in analysis/icu directory).

ICU4J also supports Chinese, of course.

I wrote a test program using ICUCollationKeyAnalyzer, it works well in
Japanese Hiragana/Katakana.
Here is a code snippet.

Analyzer collationAnalyzer = new
IndexWriter writer = new IndexWriter(dir, new
IndexWriterConfig(Version.LUCENE_4_10_2, collationAnalyzer));

I understand collation is a very difficult problem, so I am not sure this
works for you...
I would appreciate if you share your trial/research.


2014-12-17 20:54 GMT+09:00 Nils Knappmeier <>:
> Hi,
> is there any implementation for a chinese collator in Lucene. I've seen
> that there is a chinese analyzer which uses Hidden Markov Models. But
> sorting seems to be an issue on its own and all my googling hasn't led to
> any results yet.
> I understand that this is not a trivial issue and I've read that the
> chinese tend to prefer other ordering than by name, since sorting orders
> are so complicated that nobody wants to use them. But we will have to sort
> search results by name, even though the name is chinese (simplified chinese
> at the moment, but traditional may also appear later) and currenty chinese
> words seem to be ordered by their unicode-number, which seems not to be the
> right order.
> Thanks in advance for any suggestion,
>  Nils

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message