lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nils Knappmeier <n.knappme...@i-views.de>
Subject Re: Chinese sorting
Date Fri, 19 Dec 2014 08:16:06 GMT
Hi Tomoko,

thank you for the detailed explanation and many thanks for trying out 
the analyzer for me.
I think "Very good compared to Unicode codepoint based sorting" is good 
enough for me.

I will just try and use that Analyzer and see how it satisfies our customer.

Regards,
   Nils




On 18.12.2014 19:16, Tomoko Uchida wrote:
> Yes, sorting Kanji is not so easy as Hiragana/Kanji.
>
> We simply expect that collators sort strings based on phonetics regardless
> of how they written in (Hiragana, Katakana, Kanji.)
> However a Kanji has multiple (usually 2 or 3) readings. We human naturally
> make judgement which reading is suitable depending on the situation.
> That makes things difficult. Maybe an ideal collator should behave and
> judge like human.
>
> Sorry for a long preamble,
> I have tried ICUCollationKeyAnalyzer for Kanji, found "not so bad". Very
> good compared to Unicode codepoint based sorting, but far from perfect.
> I don't fully know the algorithm they use, but the accuracy might be
> heavily depends on dictionaries/standards they have.
>
> (Just an FYI,) Collators can take rules for adjustment.
> http://userguide.icu-project.org/collation/api
>
> Regards,
> Tomoko
>
>
>
>
> 2014-12-18 18:19 GMT+09:00 Nils Knappmeier <n.knappmeier@i-views.de>:
>> Hi Tomoko,
>>
>> does sorting with Locala.JAPANESE also work for Kanji. Since Hiragana and
>> Katakana are based on the phonetics, I guess it is easier to define a
>> sorting order. But Kanji is more similar to the Chinese.
>>
>> Thanks,
>>    Nils
>>
>>
>> On 17.12.2014 17:01, Tomoko Uchida wrote:
>>
>>> Hi, Nils,
>>>
>>> I don't know Chinese at all... but collation is very important in Japanese
>>> too.
>>> Lucene has org.apache.lucene.collation package that use ICU4J's collators
>>> (you can find "lucene-analyzers-icu-4.10.2.jar" in analysis/icu
>>> directory).
>>> http://lucene.apache.org/core/4_10_2/analyzers-icu/index.
>>> html?org/apache/lucene/collation/package-summary.html
>>>
>>> ICU4J also supports Chinese, of course.
>>> http://site.icu-project.org/charts/collation-icu4j-sun
>>>
>>> I wrote a test program using ICUCollationKeyAnalyzer, it works well in
>>> Japanese Hiragana/Katakana.
>>> Here is a code snippet.
>>>
>>> Analyzer collationAnalyzer = new
>>> ICUCollationKeyAnalyzer(Version.LUCENE_4_10_2,
>>> Collator.getInstance(Locale.JAPANESE));
>>> IndexWriter writer = new IndexWriter(dir, new
>>> IndexWriterConfig(Version.LUCENE_4_10_2, collationAnalyzer));
>>>
>>> I understand collation is a very difficult problem, so I am not sure this
>>> works for you...
>>> I would appreciate if you share your trial/research.
>>>
>>> Regards,
>>> Tomoko
>>>
>>> 2014-12-17 20:54 GMT+09:00 Nils Knappmeier <n.knappmeier@i-views.de>:
>>>
>>>> Hi,
>>>>
>>>> is there any implementation for a chinese collator in Lucene. I've seen
>>>> that there is a chinese analyzer which uses Hidden Markov Models. But
>>>> sorting seems to be an issue on its own and all my googling hasn't led to
>>>> any results yet.
>>>>
>>>> I understand that this is not a trivial issue and I've read that the
>>>> chinese tend to prefer other ordering than by name, since sorting orders
>>>> are so complicated that nobody wants to use them. But we will have to
>>>> sort
>>>> search results by name, even though the name is chinese (simplified
>>>> chinese
>>>> at the moment, but traditional may also appear later) and currenty
>>>> chinese
>>>> words seem to be ordered by their unicode-number, which seems not to be
>>>> the
>>>> right order.
>>>>
>>>> Thanks in advance for any suggestion,
>>>>    Nils
>>>>
>>>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message