lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nils Knappmeier <n.knappme...@i-views.de>
Subject Re: Chinese sorting
Date Thu, 18 Dec 2014 09:19:50 GMT
Hi Tomoko,

does sorting with Locala.JAPANESE also work for Kanji. Since Hiragana 
and Katakana are based on the phonetics, I guess it is easier to define 
a sorting order. But Kanji is more similar to the Chinese.

Thanks,
   Nils

On 17.12.2014 17:01, Tomoko Uchida wrote:
> Hi, Nils,
>
> I don't know Chinese at all... but collation is very important in Japanese
> too.
> Lucene has org.apache.lucene.collation package that use ICU4J's collators
> (you can find "lucene-analyzers-icu-4.10.2.jar" in analysis/icu directory).
> http://lucene.apache.org/core/4_10_2/analyzers-icu/index.html?org/apache/lucene/collation/package-summary.html
>
> ICU4J also supports Chinese, of course.
> http://site.icu-project.org/charts/collation-icu4j-sun
>
> I wrote a test program using ICUCollationKeyAnalyzer, it works well in
> Japanese Hiragana/Katakana.
> Here is a code snippet.
>
> Analyzer collationAnalyzer = new
> ICUCollationKeyAnalyzer(Version.LUCENE_4_10_2,
> Collator.getInstance(Locale.JAPANESE));
> IndexWriter writer = new IndexWriter(dir, new
> IndexWriterConfig(Version.LUCENE_4_10_2, collationAnalyzer));
>
> I understand collation is a very difficult problem, so I am not sure this
> works for you...
> I would appreciate if you share your trial/research.
>
> Regards,
> Tomoko
>
> 2014-12-17 20:54 GMT+09:00 Nils Knappmeier <n.knappmeier@i-views.de>:
>> Hi,
>>
>> is there any implementation for a chinese collator in Lucene. I've seen
>> that there is a chinese analyzer which uses Hidden Markov Models. But
>> sorting seems to be an issue on its own and all my googling hasn't led to
>> any results yet.
>>
>> I understand that this is not a trivial issue and I've read that the
>> chinese tend to prefer other ordering than by name, since sorting orders
>> are so complicated that nobody wants to use them. But we will have to sort
>> search results by name, even though the name is chinese (simplified chinese
>> at the moment, but traditional may also appear later) and currenty chinese
>> words seem to be ordered by their unicode-number, which seems not to be the
>> right order.
>>
>> Thanks in advance for any suggestion,
>>   Nils
>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message