lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wolf-Dietrich.Mate...@empolis.com
Subject AW: Do we really need CJKAnalyzer to search japanese characters
Date Tue, 29 Jun 2004 08:21:47 GMT
Hello,

Praveen Peddi [mailto:ppeddi@contextmedia.com] wrote:
> You will have to excuse me if the question looks dumb ;)
Note: I don't speak chinese or japanese at all, but I talked
with expects about this topic some times ago.

> I didn't use CJKAnalyzer and I could still search japanese characters.
> Actually I used it first but then I thought of testing with
> just the standard analyzer. It worked with standard analyzer also.
Usally there are no word delimiters in japanese and chinese text,
so tokenization is a more difficult than european languages.

> I was able to search the metadata of our objects that has
> chinese and japanese characters.
Maybe the metadata contains spaces or other delimiters, so the standard
analyser could split it into words. But this is an exception.

> I think lucene is internally storing unicode characters. So
> should it matter if its standard analyzer or CJK analyzer?
The problem is how to split a chinese or japanese text into word, e.g.
本初子午綫 is a chinese word and means "prime meridian".
布赖斯峡谷国家公园 is a phrase and means "public park of bulasi xiagu",
it should be splitted into: 布赖斯(bulasi), 峡谷 (glen), 国家 (nation ->
public)
and 公园 (parc).

> When do we have to use CJKAnalyzer really?
You need it for both languages (unless you are using a better solution).
A chinese (or japanese kanji) character could be a single word or a part of
it.
The meaning depends on the context: There is a kanji that means "tree",
the same characters repeated twice means "grove" and three times means
"forest"/"wood".
(I hope, I remember correctly). Tokenization is not an easy task.

There are tools, e.g. "linguistic platform" by Inxight, that are able to
perform word
segmentation by analysing the text using knowledge about grammar, but they
are commercial.

A more simple solution is to split each input into sequences of two and/or
three letters.
Note this is much better than to split it into single characters, because a
single character
could be a part of a word with a different meaning. You are losing to much
precision too.
Image what happens if you split an english text into single letters and
search
for letters instead of words.
CJKAnalyzer is probaly not perfect, but the developer is a native chinese
speaker,
so I'm sure, that he knows how to deal these problems.
Regards,
	Wolf-Dietrich Materna

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message