lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marios Skounakis <msc...@exis.com.gr>
Subject Stemmer Implementation Strategy - feedback?
Date Fri, 04 Aug 2006 17:29:53 GMT



Hi all,

The contrib section of Lucene contains a Greek Analyzer, which however only does
some letter normalization (capitals to lowercase, accent removal) and basic stop
word removal. 

I am interested in creating a Stemmer for the Greek Language to use with Lucene
(i.e. implement it as an analyzer). The Greek Language is quite different from
English (and most latin-related languages) in that it is highly inflectional(?) -
meaning that there is a large number of suffixes, many of which are not produced
in a very straightforward way.

A quick internet search did not return much information - a couple of
non-publicly available papers and a Master's Thesis with a javascript
implementation which, however, seems to be somewhat lacking in precision (i.e.
produces erroneous stems). A disappointing picture, admittedly, when for the
english language it is so easy to find a public domain high quality stemmer like
Porter's...

Anyway, to cut a long story short, I had the following idea in order to counter
the problem of the multiple suffixes and the high inflectional nature of the
language: implement the stemmer using a combination of a lexicon of stems (of the
most common words) and a list of all possible suffixes. The algorithm for finding
the stem of a word would be something like:

- for each suffix in the list of suffixes
   - remove the suffix from the word (if possible), producing a candindate stem
   - search the lexicon of stems for the candidate stem
       - if the search is succesful return the candidate stem
       - if the search is unsuccesful go to next suffix
- if the suffixes are exhausted and no match is found, the word cannot be stemmed
(return the original word)

[By the way the algorithm is inspired by a paper which descripbes the
implementation of a lemmatizer in a similar way - citeseer link:
http://citeseer.ist.psu.edu/694579.html]

The question is: 

Is such a strategy that depends on a leixcon of predefined stems for implementing
a stemmer considered a major drawback? In theory it can be (compared to an
algorith that works purely with rules, like Porter's) a drawback, but in
practice, with a lexicon of a few thousand stems, the stemmer could achieve
pretty good recall (and good precision too). 

Other issues to comment on are the lexicon size (which will have to be embedded
in or accompany the stemmer), memory issues in running the stemmer (keep the
lexicon in memory?), and performance issues (multiple lookups in the lexicon
could make it much slower than a rule based stemmer?). In general, any feedback
would be appreciated.

Thanks in advance,

Marios Skounakis
---- Msg sent via eXis webmail

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message