lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eyal Naamati" <Eyal.Naam...@exlibrisgroup.com>
Subject RE: ICUTransformFilter with traditional to simplified Chinese
Date Tue, 19 Dec 2017 14:06:47 GMT
Thanks!
 I actually did ready the Stanford posts when we implemented our index, it was very helpful!

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org] 
Sent: Tuesday, December 19, 2017 1:31 AM
To: solr-user@lucene.apache.org
Subject: Re: ICUTransformFilter with traditional to simplified Chinese

On 12/18/2017 9:49 AM, Eyal Naamati wrote:
> We are using the ICUTransformFilter to normalize traditional Chinese text to simplified
Chinese.
> We received feedback from some of our Chinese customers that there are some traditional
characters that are not converted to their simplified variants.
> For example:
> "眞" should be converted to "真"
> "硏" should be converted to "研"
> "夲" should be converted to "本"
>
> Does anyone know if this is indeed a problem with the filter?
> Or if there are other options to use instead of this filter that handle more characters?

I have one index for a website we built for a customer in Japan.  While researching how to
effectively handle CJK characters, I came across an entire series of blog posts.  Here's
the first post, you can check other posts on the same blog for most posts on the same subject. 
There are a lot of them:

https://urldefense.proofpoint.com/v2/url?u=http-3A__discovery-2Dgrindstone.blogspot.com_2013_10_cjk-2Dwith-2Dsolr-2Dfor-2Dlibraries-2Dpart-2D1.html&d=DwIDaQ&c=WMhnfwkfN4LR6wX29ZSgFCZf_hw4vy5MAv7iZJNaAD4&r=S7QZWqfcOLl62Mpd8PUcA3-3z78voYLEsrnT2uiQKyE&m=J8kpguaEjPrrfMdxNEkG3iroVDzr3790eDeGGSR38iw&s=ZsqkNmNtZFgRxog-CW6KYJ28NtGoZq91tuixLQ8lJIw&e=

One of the filters that Stanford utilized (and we also implemented) is a custom filter that
they wrote, apparently specifically because there are things that the ICU filters included
with Lucene do not catch.

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sul-2Ddlss_CJKFoldingFilter&d=DwIDaQ&c=WMhnfwkfN4LR6wX29ZSgFCZf_hw4vy5MAv7iZJNaAD4&r=S7QZWqfcOLl62Mpd8PUcA3-3z78voYLEsrnT2uiQKyE&m=J8kpguaEjPrrfMdxNEkG3iroVDzr3790eDeGGSR38iw&s=3-FHJky_wxpuxfDuVVbukGBeYtL43_G49vBH7xaTStY&e=

Looking into the code for the custom filter and checking into your first example, this filter
actually seems to go in the reverse direction -- it converts 真 to 眞.  I did not look
into the other examples, and I'm completely clueless about CJK characters, so I don't know
what those characters are or what the correct action would be.

That third-party custom filter would probably be helpful to you.  Even though it goes in
the reverse direction for your first example, as long as the behavior at index time and query
time is the same, you should still get matches.  End users would most likely never see the
results of the analysis.

Whether or not the behavior you've noticed is a bug with ICUTransformFilter is a question
that I cannot answer.  If it is, then the bug will be in ICU, not Lucene.

https://urldefense.proofpoint.com/v2/url?u=http-3A__lucene.apache.org_core_7-5F1-5F0_analyzers-2Dicu_org_apache_lucene_analysis_icu_ICUTransformFilter.html&d=DwIDaQ&c=WMhnfwkfN4LR6wX29ZSgFCZf_hw4vy5MAv7iZJNaAD4&r=S7QZWqfcOLl62Mpd8PUcA3-3z78voYLEsrnT2uiQKyE&m=J8kpguaEjPrrfMdxNEkG3iroVDzr3790eDeGGSR38iw&s=XoPsu6iF8r_aEHXuep-m3vILU8vIfilW0uv82ZRQtUA&e=

Thanks,
Shawn

Mime
View raw message