lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerome Lanneluc <jerome_lanne...@fr.ibm.com>
Subject Re: Chinese analyzer
Date Thu, 24 Jan 2013 15:53:32 GMT
It looks like my attachment was lost. It referred to 
org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer.

I'm inlining it here:

import java.io.IOException;
import java.io.StringReader;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;

public class ChineseTokenizerTest {
        public static void main(String[] args) throws IOException {
                tokenizeChineseWords("我是中国人"/*"我"(I) "是"(am) "中国" 
"人"(Chinese = people of China)*/);
                tokenizeChineseWords("?");
        }

        private static void tokenizeChineseWords(String chineseWords) 
throws IOException {
                SmartChineseAnalyzer analyzer = new 
SmartChineseAnalyzer(Version.LUCENE_36);
                TokenStream tokenizer = analyzer.tokenStream(null/*field 
name*/, new StringReader(chineseWords));
                System.out.print("Sentence: ");
                print(chineseWords);
                System.out.println();
                System.out.print("Tokens: [");
                while (tokenizer.incrementToken()) {
                        CharSequence charTermAttribute = 
tokenizer.getAttribute(CharTermAttribute.class);
                        print(charTermAttribute);
                        System.out.print(" ");
                }
                System.out.println("]");
                System.out.println();
        }

        private static void print(CharSequence charTermAttribute) {
                System.out.print(charTermAttribute);
                System.out.print("(");
                for (int i = 0, length = charTermAttribute.length(); i < 
length; i++) {
                        System.out.print((int) 
charTermAttribute.charAt(i));
                        if (i < length-1)
                                System.out.print(" ");
                }
                System.out.print(")");
        }
}



From:   Robert Muir <rcmuir@gmail.com>
To:     java-user@lucene.apache.org, 
Date:   01/24/2013 04:31 PM
Subject:        Re: Chinese analyzer



On Thu, Jan 24, 2013 at 9:25 AM, Jerome Lanneluc
<jerome_lanneluc@fr.ibm.com> wrote:
> Note the 2 tokens in the second sample when I would expect to have only 
one
> token with the (55401 57046) characters.
>
> I could not figure out if I'm doing something wrong, or if this is a bug 
in
> the Chinese analyzer.
>

Which analyzer specifically? there is more than one...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




Sauf indication contraire ci-dessus:/ Unless stated otherwise above:
Compagnie IBM France
Siège Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 653.242.306,20 
SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A 
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message