lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Susheel Kumar <susheel2...@gmail.com>
Subject Re: Question regarding searching Chinese characters
Date Fri, 20 Jul 2018 12:59:35 GMT
Have you tried to use CJKFoldingFilter
https://github.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
cover your use case but I am using this filter and so far no issues.

Thnx

On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <amanda.shuman@gmail.com>
wrote:

> Thanks, Alex - I have seen a few of those links but never considered
> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> basically what is laid out in the old blogspot post, namely this point:
>
>
> "Why approach CJK resource discovery differently?
>
> 2.  Search results must be as script agnostic as possible.
>
> There is more than one way to write each word. "Simplified" characters were
> emphasized for printed materials in mainland China starting in the 1950s;
> "Traditional" characters were used in printed materials prior to the 1950s,
> and are still used in Taiwan, Hong Kong and Macau today.
> Since the characters are distinct, it's as if Chinese materials are written
> in two scripts.
> Another way to think about it:  every written Chinese word has at least two
> completely different spellings.  And it can be mix-n-match:  a word can be
> written with one traditional  and one simplified character.
> Example:   Given a user query 舊小說  (traditional for old fiction), the
> results should include matches for 舊小說 (traditional) and 旧小说 (simplified
> characters for old fiction)"
>
> So, using the example provided above, we are dealing with materials
> produced in the 1950s-1970s that do even weirder things like:
>
> A. 舊小說
>
> can also be
>
> B. 旧小说 (all simplified)
> or
> C. 旧小說 (first character simplified, last character traditional)
> or
> D. 舊小 说 (first character traditional, last character simplified)
>
> Thankfully the middle character was never simplified in recent times.
>
> From a historical standpoint, the mixed nature of the characters in the
> same word/phrase is because not all simplified characters were adopted at
> the same time by everyone uniformly (good times...).
>
> The problem seems to be that Solr can easily handle A or B above, but NOT C
> or D using the Smart Chinese analyzer. I'm not really sure how to change
> that at this point... maybe I should figure out how to contact the creators
> of the analyzer and ask them?
>
> Amanda
>
> ------
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> <http://www.maoistlegacy.uni-freiburg.de/>
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <arafalov@gmail.com
> >
> wrote:
>
> > This is probably your start, if not read already:
> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> >
> > Otherwise, I think your answer would be somewhere around using ICU4J,
> > IBM's library for dealing with Unicode: http://site.icu-project.org/
> > (mentioned on the same page above)
> > Specifically, transformations:
> > http://userguide.icu-project.org/transforms/general
> >
> > With that, maybe you map both alphabets into latin. I did that once
> > for Thai for a demo:
> > https://github.com/arafalov/solr-thai-test/blob/master/
> > collection1/conf/schema.xml#L34
> >
> > The challenge is to figure out all the magic rules for that. You'd
> > have to dig through the ICU documentation and other web pages. I found
> > this one for example:
> > http://avajava.com/tutorials/lessons/what-are-the-system-
> > transliterators-available-with-icu4j.html;jsessionid=
> > BEAB0AF05A588B97B8A2393054D908C0
> >
> > There is also 12 part series on Solr and Asian text processing, though
> > it is a bit old now: http://discovery-grindstone.blogspot.com/
> >
> > Hope one of these things help.
> >
> > Regards,
> >    Alex.
> >
> >
> > On 20 July 2018 at 03:54, Amanda Shuman <amanda.shuman@gmail.com> wrote:
> > > Hi all,
> > >
> > > We have a problem. Some of our historical documents have mixed together
> > > simplified and Chinese characters. There seems to be no problem when
> > > searching either traditional or simplified separately - that is, if a
> > > particular string/phrase is all in traditional or simplified, it finds
> > it -
> > > but it does not find the string/phrase if the two different characters
> > (one
> > > traditional, one simplified) are mixed together in the SAME
> > string/phrase.
> > >
> > > Has anyone ever handled this problem before? I know some libraries seem
> > to
> > > have implemented something that seems to be able to handle this, but
> I'm
> > > not sure how they did so!
> > >
> > > Amanda
> > > ------
> > > Dr. Amanda Shuman
> > > Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> > > <http://www.maoistlegacy.uni-freiburg.de/>
> > > PhD, University of California, Santa Cruz
> > > http://www.amandashuman.net/
> > > http://www.prchistoryresources.org/
> > > Office: +49 (0) 761 203 4925
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message