lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheolgoo Kang <app...@gmail.com>
Subject Re: korean and lucene
Date Tue, 08 Nov 2005 08:53:10 GMT
On 11/8/05, Youngho Cho <youngho@nannet.co.kr> wrote:
> Hello,
>
> just simple test ...
> If I compile the javacc correctly..
> the patched version doesn't match some situation
> for example
> in text
> '엔진박지성(맨체스터 유나이티드)이 주말 프리미어리그를 위해
벤치를 지키며 재충전의 시간을 가졌다.'
> if query word is '시간'   than nothing match
> but if query word is '시간을'  than good match.

That's exactly what I've wanted to do with this patch. You need more
sophisticated morphological analyzer to to what you've wanted to.
AFAIK, I'm afraid there is no open sourced software to do the Korean
morphological analysis.

And also, if you have Lucene in Action Korean translation, there is a
simple introduction to Korean stemming analysis at Appendix D. That's
just simple enough to hold a list of Korean word endings, and check
for each Token with matching word endings. :)

>
> I think there is some tradeoff here.
>
> Maybe need some good stop filter for korean etc.......
>
>
> Thanks
>
> Youngho
>
> ----- Original Message -----
> From: "Youngho Cho" <youngho@nannet.co.kr>
> To: <java-user@lucene.apache.org>
> Sent: Tuesday, November 08, 2005 4:44 PM
> Subject: Re: korean and lucene
>
>
> > Hello Cheolgoo,
> >
> > I will test the patch.
> >
> >
> > Thanks,
> >
> > Youngho
> >
> > ----- Original Message -----
> > From: "Cheolgoo Kang" <appler@gmail.com>
> > To: <java-user@lucene.apache.org>; "Youngho Cho" <youngho@nannet.co.kr>
> > Sent: Tuesday, November 08, 2005 4:06 PM
> > Subject: Re: korean and lucene
> >
> >
> > > Hello,
> > >
> > > I've created a new JIRA issue with Korean analysis that
> > > StandardAnalyzer splits one word into several tokens each with one
> > > character. Cause Korean is not a phonogram, one character in Korean
> > > has almost no meaning at all. So word in Korean should be preserved
> > > not like Chinese or Japanese.
> > >
> > > I've attached a patch to StandardTokenizer to do this and passed the
> > > test case TestStandardAnalyzer(also patched to test Korean words).
> > >
> > > Thanks!
> > >
> > > On 10/27/05, Youngho Cho <youngho@nannet.co.kr> wrote:
> > > > Hello,
> > > >
> > > > Ok , I've attached my test code for Korean which is slitely modified Koji's
code.
> > > >
> > > > Just put into the lia.analysis.i18n package at LuceneInAction
> > > > and run ant.
> > > >
> > > > Hopely someone is helped.
> > > >
> > > > -------- build.xml  ---------
> > > >
> > > >   <target name="JapaneseDemo" depends="prepare"
> > > >           description="Examples of Jananese analysis">
> > > >     <info>
> > > >
> > > >       Japanese Test...
> > > >
> > > >     </info>
> > > >
> > > >     <run-main class="lia.analysis.i18n.JapaneseDemo"/>
> > > >   </target>
> > > >
> > > >   <target name="KoreanDemo" depends="prepare"
> > > >           description="Examples of Korean analysis">
> > > >     <info>
> > > >
> > > >       Korean Test...
> > > >
> > > >     </info>
> > > >
> > > >     <run-main class="lia.analysis.i18n.KoreanDemo"/>
> > > >   </target>
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Youngho
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: "Youngho Cho" <youngho@nannet.co.kr>
> > > > To: <java-user@lucene.apache.org>; "Youngho Cho" <youngho@nannet.co.kr>
> > > > Sent: Thursday, October 27, 2005 12:47 PM
> > > > Subject: Re: korean and lucene
> > > >
> > > >
> > > > > Hello all
> > > > > Plese forgive me pervious my stupid message
> > > > >
> > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > >      [java] [경] [기]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > >      [java] phrase = 경기
> > > > >      [java] query = "경 기"
> > > > >
> > > > > I got the good result.
> > > > >
> > > > > When I compile I just rename old version lucene-1.4.3.jar to lucene-1.4.3.jar_bak
> > > > > and all new 1.9 lucene. and build the test package.
> > > > > After I remove lucene-1.4.3.jar_bak in lib directory completely
> > > > > I got the expected result !!!.
> > > > >
> > > > > I don't know the reason... ( looks like my finger make some trouble...
)
> > > > >
> > > > > Anyway thanks Koji and Cheolgoo
> > > > > I will further test now...
> > > > >
> > > > > Youngho
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > From: "Youngho Cho" <youngho@nannet.co.kr>
> > > > > To: <java-user@lucene.apache.org>
> > > > > Sent: Thursday, October 27, 2005 12:28 PM
> > > > > Subject: Re: korean and lucene
> > > > >
> > > > >
> > > > > > Hello Koji
> > > > > >
> > > > > > Here is test result.
> > > > > > Japanese is OK !.
> > > > > > maybe ant clean  did some effect.
> > > > > >
> > > > > > Anyway please refer to the result using 1.9
> > > > > >
> > > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > > >      [java] [ラ] [メ] [ン] [屋]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > > >      [java] phrase = ラ?メン屋
> > > > > >      [java] query = content:ラ?メン屋
> > > > > >
> > > > > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > > >      [java] phrase = 경
> > > > > >      [java] query =
> > > > > >
> > > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > > >      [java] [ラ] [メン] [ン屋]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > > > >      [java] phrase = ラ?メン屋
> > > > > >      [java] query = content:ラ?メン屋
> > > > > >
> > > > > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > >      [java] [경]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > > > >      [java] phrase = 경
> > > > > >      [java] query = 경
> > > > > >
> > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > > >      [java] phrase = 경기
> > > > > >      [java] query =
> > > > > >
> > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > >      [java] [경기]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > > > >      [java] phrase = 경기
> > > > > >      [java] query = 경기
> > > > > >
> > > > > >
> > > > > > Standard analyzer didn't tokenized the Korean Character at all....
> > > > > >
> > > > > > Ug....  look like
> > > > > >  http://issues.apache.org/jira/browse/LUCENE-444
> > > > > >  didn't effect at all for Korean.
> > > > > >
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > Youngho
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > From: "Koji Sekiguchi" <koji.sekiguchi@m4.dion.ne.jp>
> > > > > > To: <java-user@lucene.apache.org>; "Youngho Cho" <youngho@nannet.co.kr>
> > > > > > Sent: Thursday, October 27, 2005 11:47 AM
> > > > > > Subject: RE: korean and lucene
> > > > > >
> > > > > >
> > > > > > > Hello Youngho,
> > > > > > >
> > > > > > > I don't understand why you couldn't get hits result in
Japanese,
> > > > > > > though, you had better check why the query was empty with
Korean data:
> > > > > > >
> > > > > > > > For Korean
> > > > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > > >      [java] phrase = 경
> > > > > > > >      [java] query =
> > > > > > >
> > > > > > > The last line should be query = 경
> > > > > > > to get hits result. Can you check why StandardAnalyzer
> > > > > > > removes "경" during tokenizing?
> > > > > > >
> > > > > > > Koji
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > > > Sent: Thursday, October 27, 2005 11:37 AM
> > > > > > > > To: java-user@lucene.apache.org
> > > > > > > > Subject: Re: korean and lucene
> > > > > > > >
> > > > > > > >
> > > > > > > > Hello Koji,
> > > > > > > >
> > > > > > > > Thanks for your kind reply.
> > > > > > > >
> > > > > > > > Yes, I used QueryParser. normaly I used
> > > > > > > > Query = QueryParser.parse( ) method.
> > > > > > > >
> > > > > > > > I put your sample code into lia.analysis.i18n package
in LuceneAction
> > > > > > > > and run JapaneseDemo using 1.4 and 1.9
> > > > > > > >
> > > > > > > > results are
> > > > > > > >
> > > > > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > > > > >      [java] query = content:ラ?メン屋
> > > > > > > >
> > > > > > > > I can't get hits result.
> > > > > > > >
> > > > > > > > For Korean
> > > > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > > >      [java] phrase = 경
> > > > > > > >      [java] query =
> > > > > > > >
> > > > > > > > I can't get query parse result.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Youngho
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ----- Original Message -----
> > > > > > > > From: "Koji Sekiguchi" <koji.sekiguchi@m4.dion.ne.jp>
> > > > > > > > To: <java-user@lucene.apache.org>; "Youngho
Cho" <youngho@nannet.co.kr>
> > > > > > > > Sent: Thursday, October 27, 2005 9:48 AM
> > > > > > > > Subject: RE: korean and lucene
> > > > > > > >
> > > > > > > >
> > > > > > > > > Hi Youngho,
> > > > > > > > >
> > > > > > > > > With regard to Japanese, using StandardAnalyzer,
> > > > > > > > > I can search a word/phase.
> > > > > > > > >
> > > > > > > > > Did you use QueryParser? StandardAnalyzer tokenizes
> > > > > > > > > CJK characters into a stream of single character.
> > > > > > > > > Use QueryParser to get a PhraseQuery and search
the query.
> > > > > > > > >
> > > > > > > > > Please see the following sample code. Replace
Japanese
> > > > > > > > > "contents" and (search target) "phrase" with
Korean in the
> > > > > > > > program and run.
> > > > > > > > >
> > > > > > > > > regards,
> > > > > > > > >
> > > > > > > > > Koji
> > > > > > > > >
> > > > > > > > > =============================================
> > > > > > > > > import java.io.IOException;
> > > > > > > > > import org.apache.lucene.analysis.Analyzer;
> > > > > > > > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > > > > > > > import org.apache.lucene.analysis.cjk.CJKAnalyzer;
> > > > > > > > > import org.apache.lucene.store.Directory;
> > > > > > > > > import org.apache.lucene.store.RAMDirectory;
> > > > > > > > > import org.apache.lucene.index.IndexWriter;
> > > > > > > > > import org.apache.lucene.document.Document;
> > > > > > > > > import org.apache.lucene.document.Field;
> > > > > > > > > import org.apache.lucene.search.IndexSearcher;
> > > > > > > > > import org.apache.lucene.search.Hits;
> > > > > > > > > import org.apache.lucene.search.Query;
> > > > > > > > > import org.apache.lucene.queryParser.QueryParser;
> > > > > > > > > import org.apache.lucene.queryParser.ParseException;
> > > > > > > > >
> > > > > > > > > public class JapaneseByStandardAnalyzer {
> > > > > > > > >
> > > > > > > > >     private static final String FIELD_CONTENT
= "content";
> > > > > > > > >     private static final String[] contents =
{
> > > > > > > > > "東京にはおいしいラーメン屋がたくさんあります。",
> > > > > > > > > "北海道にもおいしいラーメン屋があります。"
> > > > > > > > >     };
> > > > > > > > >     private static final String phrase = "ラーメン屋";
> > > > > > > > >     //private static final String phrase = "屋";
> > > > > > > > >     private static Analyzer analyzer = null;
> > > > > > > > >
> > > > > > > > >     public static void main( String[] args )
throws
> > > > > > > > IOException, ParseException {
> > > > > > > > > Directory directory = makeIndex();
> > > > > > > > > search( directory );
> > > > > > > > > directory.close();
> > > > > > > > >     }
> > > > > > > > >
> > > > > > > > >     private static Analyzer getAnalyzer(){
> > > > > > > > > if( analyzer == null ){
> > > > > > > > >     analyzer = new StandardAnalyzer();
> > > > > > > > >     //analyzer = new CJKAnalyzer();
> > > > > > > > > }
> > > > > > > > > return analyzer;
> > > > > > > > >     }
> > > > > > > > >
> > > > > > > > >     private static Directory makeIndex() throws
IOException {
> > > > > > > > > Directory directory = new RAMDirectory();
> > > > > > > > > IndexWriter writer = new IndexWriter( directory,
getAnalyzer(), true );
> > > > > > > > > for( int i = 0; i < contents.length; i++ ){
> > > > > > > > >     Document doc = new Document();
> > > > > > > > >     doc.add( new Field( FIELD_CONTENT, contents[i],
> > > > > > > > Field.Store.YES, Field.Index.TOKENIZED ) );
> > > > > > > > >     writer.addDocument( doc );
> > > > > > > > > }
> > > > > > > > > writer.close();
> > > > > > > > > return directory;
> > > > > > > > >     }
> > > > > > > > >
> > > > > > > > >     private static void search( Directory directory
) throws
> > > > > > > > IOException, ParseException {
> > > > > > > > > IndexSearcher searcher = new IndexSearcher( directory
);
> > > > > > > > > QueryParser parser = new QueryParser( FIELD_CONTENT,
getAnalyzer() );
> > > > > > > > > Query query = parser.parse( phrase );
> > > > > > > > > System.out.println( "query = " + query );
> > > > > > > > > Hits hits = searcher.search( query );
> > > > > > > > > for( int i = 0; i < hits.length(); i++ )
> > > > > > > > >     System.out.println( "doc = " + hits.doc(
i ).get( FIELD_CONTENT ) );
> > > > > > > > > searcher.close();
> > > > > > > > >     }
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > > > > > Sent: Thursday, October 27, 2005 8:18 AM
> > > > > > > > > > To: java-user@lucene.apache.org; Cheolgoo
Kang
> > > > > > > > > > Subject: Re: korean and lucene
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hello Cheolgoo,
> > > > > > > > > >
> > > > > > > > > > Now I updated my lucene version to 1.9 for
using StandardAnalyzer
> > > > > > > > > > for Korean.
> > > > > > > > > > And tested your patch which is already adopted
in 1.9
> > > > > > > > > >
> > > > > > > > > > http://issues.apache.org/jira/browse/LUCENE-444
> > > > > > > > > >
> > > > > > > > > > But Still I have no good  results with Korean
compare with
> > > > > > > > CJKAnalyzer.
> > > > > > > > > >
> > > > > > > > > > Single character is good match but more
two character word
> > > > > > > > > > doesn't match at all.
> > > > > > > > > >
> > > > > > > > > > Am I something missing or still there need
some more works ?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > Youngho.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ----- Original Message -----
> > > > > > > > > > From: "Cheolgoo Kang" <appler@gmail.com>
> > > > > > > > > > To: <java-user@lucene.apache.org>;
"John Wang" <john.wang@gmail.com>
> > > > > > > > > > Sent: Tuesday, October 04, 2005 10:11 AM
> > > > > > > > > > Subject: Re: korean and lucene
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > StandardAnalyzer's JavaCC based StandardTokenizer.jj
cannot read
> > > > > > > > > > > Korean part of Unicode character blocks.
> > > > > > > > > > >
> > > > > > > > > > > You should 1) use CJKAnalyzer or 2)
add Korean character
> > > > > > > > > > > block(0xAC00~0xD7AF) to the CJK token
definition on the
> > > > > > > > > > > StandardTokenizer.jj file.
> > > > > > > > > > >
> > > > > > > > > > > Hope it helps.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On 10/4/05, John Wang <john.wang@gmail.com>
wrote:
> > > > > > > > > > > > Hi:
> > > > > > > > > > > >
> > > > > > > > > > > > We are running into problems with
searching on korean
> > > > > > > > > > documents. We are
> > > > > > > > > > > > using the StandardAnalyzer and
everything works with Chinese
> > > > > > > > > > and Japanese.
> > > > > > > > > > > > Are there known problems with
Korean with Lucene?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks
> > > > > > > > > > > >
> > > > > > > > > > > > -John
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Cheolgoo
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > ---------------------------------------------------------------------
> > > > > > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Cheolgoo


--
Cheolgoo
Mime
View raw message