lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: korean and lucene
Date Tue, 08 Nov 2005 10:30:38 GMT
KwonNam Son wrote:

>First of all, I really appreciate your work on Lucene for Korean words,
>
>But If we cannot support stem analyzer for Korean words, I think one
>token for one Korean character is better.
>
>When we search a word, usually we use "검색" not "검색하다". ("하다" is like
>"ed" of "searched").
>If we cannot get any result from "검색", StandardAnalyzer has no meaning
>to Korean, I may have to go back to use CJKAnalyzer.
>
>How about let the StandarAnalyzer be not changed, and add a new
>Analyzer for Korea words?
>  
>

Hello,

My knowledge of Korean is near absolute zero... however, your example 
above looks like a typical stemming process for any Western language. 
The stem is not necessarily a valid dictionary word, just something that 
uniquely "labels" a group of related words created from the same root - 
and the transformation from inflected words to a stem can be expressed 
as a series of "patch commands" (insert/remove substring).

I successfully used a Java package, originally created by Leon Galambos 
from Egothor project, to create an algorithmic stemmer for Polish 
(http://www.getopt.org/stempel). The advantage of this particular 
approach is that you don't have to encode specific grammar rules in the 
stemmer, the stemmer learns rules by itself from a training corpus. Such 
training corpus consists of pairs of inflected and base forms, and the 
library automatically learns these "patch commands", i.e. instructions 
for inserting/removing parts of an inflected word to arrive at the base 
form. This training process results in creating a stemmer table, 
reusable even for previously unseen words (based on the similarity of 
character patterns in input words).

I suggest to try the code from the link above and test how it works, 
even if you only have a moderately-sized training corpus (~500 pairs) 
the results should be positive.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message