lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: korean and lucene
Date Tue, 08 Nov 2005 10:30:38 GMT
KwonNam Son wrote:

>First of all, I really appreciate your work on Lucene for Korean words,
>But If we cannot support stem analyzer for Korean words, I think one
>token for one Korean character is better.
>When we search a word, usually we use "검색" not "검색하다". ("하다" is like
>"ed" of "searched").
>If we cannot get any result from "검색", StandardAnalyzer has no meaning
>to Korean, I may have to go back to use CJKAnalyzer.
>How about let the StandarAnalyzer be not changed, and add a new
>Analyzer for Korea words?


My knowledge of Korean is near absolute zero... however, your example 
above looks like a typical stemming process for any Western language. 
The stem is not necessarily a valid dictionary word, just something that 
uniquely "labels" a group of related words created from the same root - 
and the transformation from inflected words to a stem can be expressed 
as a series of "patch commands" (insert/remove substring).

I successfully used a Java package, originally created by Leon Galambos 
from Egothor project, to create an algorithmic stemmer for Polish 
( The advantage of this particular 
approach is that you don't have to encode specific grammar rules in the 
stemmer, the stemmer learns rules by itself from a training corpus. Such 
training corpus consists of pairs of inflected and base forms, and the 
library automatically learns these "patch commands", i.e. instructions 
for inserting/removing parts of an inflected word to arrive at the base 
form. This training process results in creating a stemmer table, 
reusable even for previously unseen words (based on the similarity of 
character patterns in input words).

I suggest to try the code from the link above and test how it works, 
even if you only have a moderately-sized training corpus (~500 pairs) 
the results should be positive.

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message