lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Youngho Cho" <youn...@nannet.co.kr>
Subject Re: korean and lucene
Date Tue, 08 Nov 2005 08:49:23 GMT
Hello,

just simple test ...
If I compile the javacc correctly..
the patched version doesn't match some situation 
for example 
in text
'엔진박지성(맨체스터 유나이티드)이 주말 프리미어리그를 위해 벤치를
지키며 재충전의 시간을 가졌다.'
if query word is '시간'   than nothing match
but if query word is '시간을'  than good match.

I think there is some tradeoff here.

Maybe need some good stop filter for korean etc.......


Thanks

Youngho

----- Original Message ----- 
From: "Youngho Cho" <youngho@nannet.co.kr>
To: <java-user@lucene.apache.org>
Sent: Tuesday, November 08, 2005 4:44 PM
Subject: Re: korean and lucene


> Hello Cheolgoo,
> 
> I will test the patch.
> 
> 
> Thanks,
> 
> Youngho
> 
> ----- Original Message ----- 
> From: "Cheolgoo Kang" <appler@gmail.com>
> To: <java-user@lucene.apache.org>; "Youngho Cho" <youngho@nannet.co.kr>
> Sent: Tuesday, November 08, 2005 4:06 PM
> Subject: Re: korean and lucene
> 
> 
> > Hello,
> > 
> > I've created a new JIRA issue with Korean analysis that
> > StandardAnalyzer splits one word into several tokens each with one
> > character. Cause Korean is not a phonogram, one character in Korean
> > has almost no meaning at all. So word in Korean should be preserved
> > not like Chinese or Japanese.
> > 
> > I've attached a patch to StandardTokenizer to do this and passed the
> > test case TestStandardAnalyzer(also patched to test Korean words).
> > 
> > Thanks!
> > 
> > On 10/27/05, Youngho Cho <youngho@nannet.co.kr> wrote:
> > > Hello,
> > >
> > > Ok , I've attached my test code for Korean which is slitely modified Koji's
code.
> > >
> > > Just put into the lia.analysis.i18n package at LuceneInAction
> > > and run ant.
> > >
> > > Hopely someone is helped.
> > >
> > > -------- build.xml  ---------
> > >
> > >   <target name="JapaneseDemo" depends="prepare"
> > >           description="Examples of Jananese analysis">
> > >     <info>
> > >
> > >       Japanese Test...
> > >
> > >     </info>
> > >
> > >     <run-main class="lia.analysis.i18n.JapaneseDemo"/>
> > >   </target>
> > >
> > >   <target name="KoreanDemo" depends="prepare"
> > >           description="Examples of Korean analysis">
> > >     <info>
> > >
> > >       Korean Test...
> > >
> > >     </info>
> > >
> > >     <run-main class="lia.analysis.i18n.KoreanDemo"/>
> > >   </target>
> > >
> > >
> > > Thanks,
> > >
> > > Youngho
> > >
> > >
> > > ----- Original Message -----
> > > From: "Youngho Cho" <youngho@nannet.co.kr>
> > > To: <java-user@lucene.apache.org>; "Youngho Cho" <youngho@nannet.co.kr>
> > > Sent: Thursday, October 27, 2005 12:47 PM
> > > Subject: Re: korean and lucene
> > >
> > >
> > > > Hello all
> > > > Plese forgive me pervious my stupid message
> > > >
> > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > >      [java] [경] [기]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > >      [java] phrase = 경기
> > > >      [java] query = "경 기"
> > > >
> > > > I got the good result.
> > > >
> > > > When I compile I just rename old version lucene-1.4.3.jar to lucene-1.4.3.jar_bak
> > > > and all new 1.9 lucene. and build the test package.
> > > > After I remove lucene-1.4.3.jar_bak in lib directory completely
> > > > I got the expected result !!!.
> > > >
> > > > I don't know the reason... ( looks like my finger make some trouble...
)
> > > >
> > > > Anyway thanks Koji and Cheolgoo
> > > > I will further test now...
> > > >
> > > > Youngho
> > > >
> > > >
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: "Youngho Cho" <youngho@nannet.co.kr>
> > > > To: <java-user@lucene.apache.org>
> > > > Sent: Thursday, October 27, 2005 12:28 PM
> > > > Subject: Re: korean and lucene
> > > >
> > > >
> > > > > Hello Koji
> > > > >
> > > > > Here is test result.
> > > > > Japanese is OK !.
> > > > > maybe ant clean  did some effect.
> > > > >
> > > > > Anyway please refer to the result using 1.9
> > > > >
> > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > >      [java] [ラ] [メ] [ン] [屋]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > >      [java] phrase = ラ?メン屋
> > > > >      [java] query = content:ラ?メン屋
> > > > >
> > > > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > >      [java] phrase = 경
> > > > >      [java] query =
> > > > >
> > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > >      [java] [ラ] [メン] [ン屋]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > > >      [java] phrase = ラ?メン屋
> > > > >      [java] query = content:ラ?メン屋
> > > > >
> > > > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > >      [java] [경]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > > >      [java] phrase = 경
> > > > >      [java] query = 경
> > > > >
> > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > >      [java] phrase = 경기
> > > > >      [java] query =
> > > > >
> > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > >      [java] [경기]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > > >      [java] phrase = 경기
> > > > >      [java] query = 경기
> > > > >
> > > > >
> > > > > Standard analyzer didn't tokenized the Korean Character at all....
> > > > >
> > > > > Ug....  look like
> > > > >  http://issues.apache.org/jira/browse/LUCENE-444
> > > > >  didn't effect at all for Korean.
> > > > >
> > > > >
> > > > > Thanks
> > > > >
> > > > > Youngho
> > > > >
> > > > > ----- Original Message -----
> > > > > From: "Koji Sekiguchi" <koji.sekiguchi@m4.dion.ne.jp>
> > > > > To: <java-user@lucene.apache.org>; "Youngho Cho" <youngho@nannet.co.kr>
> > > > > Sent: Thursday, October 27, 2005 11:47 AM
> > > > > Subject: RE: korean and lucene
> > > > >
> > > > >
> > > > > > Hello Youngho,
> > > > > >
> > > > > > I don't understand why you couldn't get hits result in Japanese,
> > > > > > though, you had better check why the query was empty with Korean
data:
> > > > > >
> > > > > > > For Korean
> > > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > >      [java] phrase = 경
> > > > > > >      [java] query =
> > > > > >
> > > > > > The last line should be query = 경
> > > > > > to get hits result. Can you check why StandardAnalyzer
> > > > > > removes "경" during tokenizing?
> > > > > >
> > > > > > Koji
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > > Sent: Thursday, October 27, 2005 11:37 AM
> > > > > > > To: java-user@lucene.apache.org
> > > > > > > Subject: Re: korean and lucene
> > > > > > >
> > > > > > >
> > > > > > > Hello Koji,
> > > > > > >
> > > > > > > Thanks for your kind reply.
> > > > > > >
> > > > > > > Yes, I used QueryParser. normaly I used
> > > > > > > Query = QueryParser.parse( ) method.
> > > > > > >
> > > > > > > I put your sample code into lia.analysis.i18n package in
LuceneAction
> > > > > > > and run JapaneseDemo using 1.4 and 1.9
> > > > > > >
> > > > > > > results are
> > > > > > >
> > > > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > > > >      [java] query = content:ラ?メン屋
> > > > > > >
> > > > > > > I can't get hits result.
> > > > > > >
> > > > > > > For Korean
> > > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > >      [java] phrase = 경
> > > > > > >      [java] query =
> > > > > > >
> > > > > > > I can't get query parse result.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Youngho
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ----- Original Message -----
> > > > > > > From: "Koji Sekiguchi" <koji.sekiguchi@m4.dion.ne.jp>
> > > > > > > To: <java-user@lucene.apache.org>; "Youngho Cho"
<youngho@nannet.co.kr>
> > > > > > > Sent: Thursday, October 27, 2005 9:48 AM
> > > > > > > Subject: RE: korean and lucene
> > > > > > >
> > > > > > >
> > > > > > > > Hi Youngho,
> > > > > > > >
> > > > > > > > With regard to Japanese, using StandardAnalyzer,
> > > > > > > > I can search a word/phase.
> > > > > > > >
> > > > > > > > Did you use QueryParser? StandardAnalyzer tokenizes
> > > > > > > > CJK characters into a stream of single character.
> > > > > > > > Use QueryParser to get a PhraseQuery and search the
query.
> > > > > > > >
> > > > > > > > Please see the following sample code. Replace Japanese
> > > > > > > > "contents" and (search target) "phrase" with Korean
in the
> > > > > > > program and run.
> > > > > > > >
> > > > > > > > regards,
> > > > > > > >
> > > > > > > > Koji
> > > > > > > >
> > > > > > > > =============================================
> > > > > > > > import java.io.IOException;
> > > > > > > > import org.apache.lucene.analysis.Analyzer;
> > > > > > > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > > > > > > import org.apache.lucene.analysis.cjk.CJKAnalyzer;
> > > > > > > > import org.apache.lucene.store.Directory;
> > > > > > > > import org.apache.lucene.store.RAMDirectory;
> > > > > > > > import org.apache.lucene.index.IndexWriter;
> > > > > > > > import org.apache.lucene.document.Document;
> > > > > > > > import org.apache.lucene.document.Field;
> > > > > > > > import org.apache.lucene.search.IndexSearcher;
> > > > > > > > import org.apache.lucene.search.Hits;
> > > > > > > > import org.apache.lucene.search.Query;
> > > > > > > > import org.apache.lucene.queryParser.QueryParser;
> > > > > > > > import org.apache.lucene.queryParser.ParseException;
> > > > > > > >
> > > > > > > > public class JapaneseByStandardAnalyzer {
> > > > > > > >
> > > > > > > >     private static final String FIELD_CONTENT = "content";
> > > > > > > >     private static final String[] contents = {
> > > > > > > > "東京にはおいしいラーメン屋がたくさんあります。",
> > > > > > > > "北海道にもおいしいラーメン屋があります。"
> > > > > > > >     };
> > > > > > > >     private static final String phrase = "ラーメン屋";
> > > > > > > >     //private static final String phrase = "屋";
> > > > > > > >     private static Analyzer analyzer = null;
> > > > > > > >
> > > > > > > >     public static void main( String[] args ) throws
> > > > > > > IOException, ParseException {
> > > > > > > > Directory directory = makeIndex();
> > > > > > > > search( directory );
> > > > > > > > directory.close();
> > > > > > > >     }
> > > > > > > >
> > > > > > > >     private static Analyzer getAnalyzer(){
> > > > > > > > if( analyzer == null ){
> > > > > > > >     analyzer = new StandardAnalyzer();
> > > > > > > >     //analyzer = new CJKAnalyzer();
> > > > > > > > }
> > > > > > > > return analyzer;
> > > > > > > >     }
> > > > > > > >
> > > > > > > >     private static Directory makeIndex() throws IOException
{
> > > > > > > > Directory directory = new RAMDirectory();
> > > > > > > > IndexWriter writer = new IndexWriter( directory, getAnalyzer(),
true );
> > > > > > > > for( int i = 0; i < contents.length; i++ ){
> > > > > > > >     Document doc = new Document();
> > > > > > > >     doc.add( new Field( FIELD_CONTENT, contents[i],
> > > > > > > Field.Store.YES, Field.Index.TOKENIZED ) );
> > > > > > > >     writer.addDocument( doc );
> > > > > > > > }
> > > > > > > > writer.close();
> > > > > > > > return directory;
> > > > > > > >     }
> > > > > > > >
> > > > > > > >     private static void search( Directory directory
) throws
> > > > > > > IOException, ParseException {
> > > > > > > > IndexSearcher searcher = new IndexSearcher( directory
);
> > > > > > > > QueryParser parser = new QueryParser( FIELD_CONTENT,
getAnalyzer() );
> > > > > > > > Query query = parser.parse( phrase );
> > > > > > > > System.out.println( "query = " + query );
> > > > > > > > Hits hits = searcher.search( query );
> > > > > > > > for( int i = 0; i < hits.length(); i++ )
> > > > > > > >     System.out.println( "doc = " + hits.doc( i ).get(
FIELD_CONTENT ) );
> > > > > > > > searcher.close();
> > > > > > > >     }
> > > > > > > > }
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > > > > Sent: Thursday, October 27, 2005 8:18 AM
> > > > > > > > > To: java-user@lucene.apache.org; Cheolgoo Kang
> > > > > > > > > Subject: Re: korean and lucene
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hello Cheolgoo,
> > > > > > > > >
> > > > > > > > > Now I updated my lucene version to 1.9 for using
StandardAnalyzer
> > > > > > > > > for Korean.
> > > > > > > > > And tested your patch which is already adopted
in 1.9
> > > > > > > > >
> > > > > > > > > http://issues.apache.org/jira/browse/LUCENE-444
> > > > > > > > >
> > > > > > > > > But Still I have no good  results with Korean
compare with
> > > > > > > CJKAnalyzer.
> > > > > > > > >
> > > > > > > > > Single character is good match but more two character
word
> > > > > > > > > doesn't match at all.
> > > > > > > > >
> > > > > > > > > Am I something missing or still there need some
more works ?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Youngho.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ----- Original Message -----
> > > > > > > > > From: "Cheolgoo Kang" <appler@gmail.com>
> > > > > > > > > To: <java-user@lucene.apache.org>; "John
Wang" <john.wang@gmail.com>
> > > > > > > > > Sent: Tuesday, October 04, 2005 10:11 AM
> > > > > > > > > Subject: Re: korean and lucene
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > StandardAnalyzer's JavaCC based StandardTokenizer.jj
cannot read
> > > > > > > > > > Korean part of Unicode character blocks.
> > > > > > > > > >
> > > > > > > > > > You should 1) use CJKAnalyzer or 2) add
Korean character
> > > > > > > > > > block(0xAC00~0xD7AF) to the CJK token definition
on the
> > > > > > > > > > StandardTokenizer.jj file.
> > > > > > > > > >
> > > > > > > > > > Hope it helps.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On 10/4/05, John Wang <john.wang@gmail.com>
wrote:
> > > > > > > > > > > Hi:
> > > > > > > > > > >
> > > > > > > > > > > We are running into problems with searching
on korean
> > > > > > > > > documents. We are
> > > > > > > > > > > using the StandardAnalyzer and everything
works with Chinese
> > > > > > > > > and Japanese.
> > > > > > > > > > > Are there known problems with Korean
with Lucene?
> > > > > > > > > > >
> > > > > > > > > > > Thanks
> > > > > > > > > > >
> > > > > > > > > > > -John
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Cheolgoo
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > ---------------------------------------------------------------------
> > > > > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> > >
> > 
> > 
> > --
> > Cheolgoo
Mime
View raw message