lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Sekiguchi" <koji.sekigu...@m4.dion.ne.jp>
Subject RE: korean and lucene
Date Thu, 27 Oct 2005 00:48:41 GMT
Hi Youngho,

With regard to Japanese, using StandardAnalyzer,
I can search a word/phase.

Did you use QueryParser? StandardAnalyzer tokenizes
CJK characters into a stream of single character.
Use QueryParser to get a PhraseQuery and search the query.

Please see the following sample code. Replace Japanese
"contents" and (search target) "phrase" with Korean in the program and run.

regards,

Koji

=============================================
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.cjk.CJKAnalyzer;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.Query;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.ParseException;

public class JapaneseByStandardAnalyzer {

    private static final String FIELD_CONTENT = "content";
    private static final String[] contents = {
	"東京にはおいしいラーメン屋がたくさんあります。",
	"北海道にもおいしいラーメン屋があります。"
    };
    private static final String phrase = "ラーメン屋";
    //private static final String phrase = "屋";
    private static Analyzer analyzer = null;

    public static void main( String[] args ) throws IOException, ParseException {
	Directory directory = makeIndex();
	search( directory );
	directory.close();
    }

    private static Analyzer getAnalyzer(){
	if( analyzer == null ){
	    analyzer = new StandardAnalyzer();
	    //analyzer = new CJKAnalyzer();
	}
	return analyzer;
    }

    private static Directory makeIndex() throws IOException {
	Directory directory = new RAMDirectory();
	IndexWriter writer = new IndexWriter( directory, getAnalyzer(), true );
	for( int i = 0; i < contents.length; i++ ){
	    Document doc = new Document();
	    doc.add( new Field( FIELD_CONTENT, contents[i], Field.Store.YES, Field.Index.TOKENIZED
) );
	    writer.addDocument( doc );
	}
	writer.close();
	return directory;
    }

    private static void search( Directory directory ) throws IOException, ParseException {
	IndexSearcher searcher = new IndexSearcher( directory );
	QueryParser parser = new QueryParser( FIELD_CONTENT, getAnalyzer() );
	Query query = parser.parse( phrase );
	System.out.println( "query = " + query );
	Hits hits = searcher.search( query );
	for( int i = 0; i < hits.length(); i++ )
	    System.out.println( "doc = " + hits.doc( i ).get( FIELD_CONTENT ) );
	searcher.close();
    }
}


> -----Original Message-----
> From: Youngho Cho [mailto:youngho@nannet.co.kr]
> Sent: Thursday, October 27, 2005 8:18 AM
> To: java-user@lucene.apache.org; Cheolgoo Kang
> Subject: Re: korean and lucene
> 
> 
> Hello Cheolgoo,
> 
> Now I updated my lucene version to 1.9 for using StandardAnalyzer 
> for Korean.
> And tested your patch which is already adopted in 1.9
> 
> http://issues.apache.org/jira/browse/LUCENE-444
> 
> But Still I have no good  results with Korean compare with CJKAnalyzer.
> 
> Single character is good match but more two character word 
> doesn't match at all.
> 
> Am I something missing or still there need some more works ?
> 
> 
> Thanks,
> 
> Youngho.
>  
> 
> ----- Original Message ----- 
> From: "Cheolgoo Kang" <appler@gmail.com>
> To: <java-user@lucene.apache.org>; "John Wang" <john.wang@gmail.com>
> Sent: Tuesday, October 04, 2005 10:11 AM
> Subject: Re: korean and lucene
> 
> 
> > StandardAnalyzer's JavaCC based StandardTokenizer.jj cannot read
> > Korean part of Unicode character blocks.
> > 
> > You should 1) use CJKAnalyzer or 2) add Korean character
> > block(0xAC00~0xD7AF) to the CJK token definition on the
> > StandardTokenizer.jj file.
> > 
> > Hope it helps.
> > 
> > 
> > On 10/4/05, John Wang <john.wang@gmail.com> wrote:
> > > Hi:
> > >
> > > We are running into problems with searching on korean 
> documents. We are
> > > using the StandardAnalyzer and everything works with Chinese 
> and Japanese.
> > > Are there known problems with Korean with Lucene?
> > >
> > > Thanks
> > >
> > > -John
> > >
> > >
> > 
> > 
> > --
> > Cheolgoo
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message