lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheolgoo Kang <app...@gmail.com>
Subject Re: korean and lucene
Date Fri, 11 Nov 2005 16:41:48 GMT
Thanks Bialecki,

I'm trying to test your program, thanks a lot!

And also, can you give me the paper you've cited [1] and [2]? I've
googled(entire web and google scholar) about it but got nothing.

On 11/8/05, Andrzej Bialecki <ab@getopt.org> wrote:
> KwonNam Son wrote:
>
> >First of all, I really appreciate your work on Lucene for Korean words,
> >
> >But If we cannot support stem analyzer for Korean words, I think one
> >token for one Korean character is better.
> >
> >When we search a word, usually we use "검색" not "검색하다". ("하다" is like
> >"ed" of "searched").
> >If we cannot get any result from "검색", StandardAnalyzer has no meaning
> >to Korean, I may have to go back to use CJKAnalyzer.
> >
> >How about let the StandarAnalyzer be not changed, and add a new
> >Analyzer for Korea words?
> >
> >
>
> Hello,
>
> My knowledge of Korean is near absolute zero... however, your example
> above looks like a typical stemming process for any Western language.
> The stem is not necessarily a valid dictionary word, just something that
> uniquely "labels" a group of related words created from the same root -
> and the transformation from inflected words to a stem can be expressed
> as a series of "patch commands" (insert/remove substring).
>
> I successfully used a Java package, originally created by Leon Galambos
> from Egothor project, to create an algorithmic stemmer for Polish
> (http://www.getopt.org/stempel). The advantage of this particular
> approach is that you don't have to encode specific grammar rules in the
> stemmer, the stemmer learns rules by itself from a training corpus. Such
> training corpus consists of pairs of inflected and base forms, and the
> library automatically learns these "patch commands", i.e. instructions
> for inserting/removing parts of an inflected word to arrive at the base
> form. This training process results in creating a stemmer table,
> reusable even for previously unseen words (based on the similarity of
> character patterns in input words).
>
> I suggest to try the code from the link above and test how it works,
> even if you only have a moderately-sized training corpus (~500 pairs)
> the results should be positive.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


--
Cheolgoo
Mime
View raw message