lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuesong Luo" <>
Subject RE: add CJKTokenizer to solr
Date Fri, 22 Jun 2007 06:54:37 GMT
Thanks, Toru and Chris,
I tried both the CJKTokenizer and CJKAnalyzer. Both return some unexpected highlight results
when I tested with Germany. The field value I searched is "Ein Mann beißt den Hund".  The
search criteria is beißt. 

When using CJKAnalyzer, beißt is treated as 2 single terms(bei and ß) the highlight result
<str>Ein Mann <em>bei</em><em>ß</em>t den Hund</str>

When using CJKTokenizer, beißt is treated as 3 single terms, the result is:
<str>Ein Mann <em>bei</em><em>ß</em><em>t</em>
den Hund</str>

When using standard tokenizer, beißt is treated as a word, the result is:
<str>Ein Mann <em>beißt</em> den Hund</str>

I understand why the standard tokenizer treat beißt as a word, but don't know how CJKAnalyzer
and CJKAnalyzer work, could anyone explain a little bit?


-----Original Message-----
From: Toru Matsuzawa [] 
Sent: Monday, June 18, 2007 10:29 PM
Subject: Re: add CJKTokenizer to solr

I'm sorry. Because it was not possible to append it, 
it sends it again. 

> > I got the error below after adding CJKTokenizer to schema.xml.  I
> > checked the constructor of CJKTokenizer, it requires a Reader parameter,
> > I guess that's why I get this error, I searched the email archive, it
> > seems working for other users. Does anyone know what is the problem?
> CJKTokenizerFactory that I am using is appended.
package org.apache.solr.analysis.ja;

import org.apache.lucene.analysis.cjk.CJKTokenizer ;

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenizerFactory;

 * CJKTokenizer for Solr
 * @see org.apache.lucene.analysis.cjk.CJKTokenizer
 * @author matsu
public class CJKTokenizerFactory extends BaseTokenizerFactory {

   * @see org.apache.solr.analysis.TokenizerFactory#create(Reader)
  public TokenStream create(Reader input) {
    return new CJKTokenizer( input );


Trou Matsuzawa

View raw message