lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roy <royde...@gmail.com>
Subject Re: AW: Do we really need CJKAnalyzer to search japanese characters
Date Wed, 30 Jun 2004 17:20:06 GMT
Tokenizing Chinese and Japanese is not an easy job at all. Bigram is
actually a poor man's solution, but its performance is fairly good.
According to a paper I read before, it has a decent precision and
recall. The cons are: you will have some meaningless tokens in your
index and that can impact precision; your index consumes more space.

On Tue, 29 Jun 2004 10:21:47 +0200, wolf-dietrich.materna@empolis.com
<wolf-dietrich.materna@empolis.com> wrote:
> Hello,
> 
> Praveen Peddi [mailto:ppeddi@contextmedia.com] wrote:
> > You will have to excuse me if the question looks dumb ;)
> Note: I don't speak chinese or japanese at all, but I talked
> with expects about this topic some times ago.
> 
> > I didn't use CJKAnalyzer and I could still search japanese characters.
> > Actually I used it first but then I thought of testing with
> > just the standard analyzer. It worked with standard analyzer also.
> Usally there are no word delimiters in japanese and chinese text,
> so tokenization is a more difficult than european languages.
> 
> > I was able to search the metadata of our objects that has
> > chinese and japanese characters.
> Maybe the metadata contains spaces or other delimiters, so the standard
> analyser could split it into words. But this is an exception.
> 
> > I think lucene is internally storing unicode characters. So
> > should it matter if its standard analyzer or CJK analyzer?
> The problem is how to split a chinese or japanese text into word, e.g.
> 本初子午�Q is a chinese word and means "prime meridian".
> 布赖斯峡谷国家公园 is a phrase and means "public park of bulasi xiagu",
> it should be splitted into: 布赖斯(bulasi), 峡谷 (glen), 国家 (nation ->
> public)
> and 公园 (parc).
> 
> > When do we have to use CJKAnalyzer really?
> You need it for both languages (unless you are using a better solution).
> A chinese (or japanese kanji) character could be a single word or a part of
> it.
> The meaning depends on the context: There is a kanji that means "tree",
> the same characters repeated twice means "grove" and three times means
> "forest"/"wood".
> (I hope, I remember correctly). Tokenization is not an easy task.
> 
> There are tools, e.g. "linguistic platform" by Inxight, that are able to
> perform word
> segmentation by analysing the text using knowledge about grammar, but they
> are commercial.
> 
> A more simple solution is to split each input into sequences of two and/or
> three letters.
> Note this is much better than to split it into single characters, because a
> single character
> could be a part of a word with a different meaning. You are losing to much
> precision too.
> Image what happens if you split an english text into single letters and
> search
> for letters instead of words.
> CJKAnalyzer is probaly not perfect, but the developer is a native chinese
> speaker,
> so I'm sure, that he knows how to deal these problems.
> Regards,
>         Wolf-Dietrich Materna
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message