lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amanda Shuman <>
Subject Re: Question regarding searching Chinese characters
Date Fri, 20 Jul 2018 12:44:06 GMT
Thanks, Alex - I have seen a few of those links but never considered
transliteration! We use lucene's Smart Chinese analyzer. The issue is
basically what is laid out in the old blogspot post, namely this point:

"Why approach CJK resource discovery differently?

2.  Search results must be as script agnostic as possible.

There is more than one way to write each word. "Simplified" characters were
emphasized for printed materials in mainland China starting in the 1950s;
"Traditional" characters were used in printed materials prior to the 1950s,
and are still used in Taiwan, Hong Kong and Macau today.
Since the characters are distinct, it's as if Chinese materials are written
in two scripts.
Another way to think about it:  every written Chinese word has at least two
completely different spellings.  And it can be mix-n-match:  a word can be
written with one traditional  and one simplified character.
Example:   Given a user query 舊小說  (traditional for old fiction), the
results should include matches for 舊小說 (traditional) and 旧小说 (simplified
characters for old fiction)"

So, using the example provided above, we are dealing with materials
produced in the 1950s-1970s that do even weirder things like:

A. 舊小說

can also be

B. 旧小说 (all simplified)
C. 旧小說 (first character simplified, last character traditional)
D. 舊小 说 (first character traditional, last character simplified)

Thankfully the middle character was never simplified in recent times.

>From a historical standpoint, the mixed nature of the characters in the
same word/phrase is because not all simplified characters were adopted at
the same time by everyone uniformly (good times...).

The problem seems to be that Solr can easily handle A or B above, but NOT C
or D using the Smart Chinese analyzer. I'm not really sure how to change
that at this point... maybe I should figure out how to contact the creators
of the analyzer and ask them?


Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project
PhD, University of California, Santa Cruz
Office: +49 (0) 761 203 4925

On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <>

> This is probably your start, if not read already:
> Otherwise, I think your answer would be somewhere around using ICU4J,
> IBM's library for dealing with Unicode:
> (mentioned on the same page above)
> Specifically, transformations:
> With that, maybe you map both alphabets into latin. I did that once
> for Thai for a demo:
> collection1/conf/schema.xml#L34
> The challenge is to figure out all the magic rules for that. You'd
> have to dig through the ICU documentation and other web pages. I found
> this one for example:
> transliterators-available-with-icu4j.html;jsessionid=
> BEAB0AF05A588B97B8A2393054D908C0
> There is also 12 part series on Solr and Asian text processing, though
> it is a bit old now:
> Hope one of these things help.
> Regards,
>    Alex.
> On 20 July 2018 at 03:54, Amanda Shuman <> wrote:
> > Hi all,
> >
> > We have a problem. Some of our historical documents have mixed together
> > simplified and Chinese characters. There seems to be no problem when
> > searching either traditional or simplified separately - that is, if a
> > particular string/phrase is all in traditional or simplified, it finds
> it -
> > but it does not find the string/phrase if the two different characters
> (one
> > traditional, one simplified) are mixed together in the SAME
> string/phrase.
> >
> > Has anyone ever handled this problem before? I know some libraries seem
> to
> > have implemented something that seems to be able to handle this, but I'm
> > not sure how they did so!
> >
> > Amanda
> > ------
> > Dr. Amanda Shuman
> > Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> > <>
> > PhD, University of California, Santa Cruz
> >
> >
> > Office: +49 (0) 761 203 4925

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message