lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheolgoo Kang <app...@gmail.com>
Subject Re: Getting Started with Korean
Date Sat, 12 Nov 2005 06:16:39 GMT
Hi,

On 11/11/05, Grant Ingersoll <gsingers@syr.edu> wrote:
> Hi,
>
> Was wondering if someone could help me out with a few things in Korean
> as related to Lucene:
> 1.  Which Analyzer do you recommend?  From the list, I see that some
> have had success with the StandardAnalyzer.  Are there any caveats I
> should be aware of if I choose to use it?

StandardAnalyzer currently in svn separates all Korean words into each
characters. As you know, Korean has almost no meaning with 'one'
character, so I've made a patch on JIRA to address this issue. You can
find it http://issues.apache.org/jira/browse/LUCENE-461. But for the
stemming, StandardTokenizer(and StandardAnalyzer) has no ability to do
it, so you need something else like CJKAnalyzer that does a bi-gram
tokenization. There currently is no lucene analyzer freely available
that does the Korean stemming like Porter, Lovins, etc.

> 2.  Could anyone point me to a fairly decent size (doesn't need to be
> huge), freely available collection?

Please check out the Sejong project(http://www.sejong.or.kr/, Sejong
is name of the king who created the Hangul in ancient times), it's
kind of a national linguistics project and has lots of Korean corpus
that is freely available for research purpose only. But those text are
provided in xxx.HWP file format, so it's hard to download-and-use in
one shot. It's very very time consuming :-( You need "Hangul 2005"
word processor to read the xxx.HWP file. (I know Sejong project
shouldn't have used a company proprietary format like HWP instead of
XML or even just TXT.)

>
> Thanks,
> Grant
>
> --
> -------------------------------------------------------------------
> Grant Ingersoll
> Sr. Software Engineer
> Center for Natural Language Processing
> Syracuse University
> School of Information Studies
> 337 Hinds Hall
> Syracuse, NY 13244
>
> http://www.cnlp.org
> Voice:  315-443-5484
> Fax: 315-443-6886
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


--
Cheolgoo

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message