lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <eks...@yahoo.co.uk>
Subject Re: Stemmer Implementation Strategy - feedback?
Date Tue, 08 Aug 2006 10:12:30 GMT
I would suggest you to have a look at Egothor stemmer (http://www.egothor.org/book/bk01ch01s06.html),
can be trained rather easily (if your only use of "roots" is for searching)

I have only heard of it as a good thing, never tried it


On Aug 4, 2006, at 1:29 PM, Marios Skounakis wrote:

>
>
>
> Hi all,
>
> The contrib section of Lucene contains a Greek Analyzer, which  
> however only does
> some letter normalization (capitals to lowercase, accent removal)  
> and basic stop
> word removal.
>
> I am interested in creating a Stemmer for the Greek Language to use  
> with Lucene
> (i.e. implement it as an analyzer). The Greek Language is quite  
> different from
> English (and most latin-related languages) in that it is highly  
> inflectional(?) -
> meaning that there is a large number of suffixes, many of which are  
> not produced
> in a very straightforward way.
>
> A quick internet search did not return much information - a couple of
> non-publicly available papers and a Master's Thesis with a javascript
> implementation which, however, seems to be somewhat lacking in  
> precision (i.e.
> produces erroneous stems). A disappointing picture, admittedly,  
> when for the
> english language it is so easy to find a public domain high quality  
> stemmer like
> Porter's...
>
> Anyway, to cut a long story short, I had the following idea in  
> order to counter
> the problem of the multiple suffixes and the high inflectional  
> nature of the
> language: implement the stemmer using a combination of a lexicon of  
> stems (of the
> most common words) and a list of all possible suffixes. The  
> algorithm for finding
> the stem of a word would be something like:
>
> - for each suffix in the list of suffixes
>    - remove the suffix from the word (if possible), producing a  
> candindate stem
>    - search the lexicon of stems for the candidate stem
>        - if the search is succesful return the candidate stem
>        - if the search is unsuccesful go to next suffix
> - if the suffixes are exhausted and no match is found, the word  
> cannot be stemmed
> (return the original word)
>
> [By the way the algorithm is inspired by a paper which descripbes the
> implementation of a lemmatizer in a similar way - citeseer link:
> http://citeseer.ist.psu.edu/694579.html]
>
> The question is:
>
> Is such a strategy that depends on a leixcon of predefined stems  
> for implementing
> a stemmer considered a major drawback? In theory it can be  
> (compared to an
> algorith that works purely with rules, like Porter's) a drawback,  
> but in
> practice, with a lexicon of a few thousand stems, the stemmer could  
> achieve
> pretty good recall (and good precision too).
>
> Other issues to comment on are the lexicon size (which will have to  
> be embedded
> in or accompany the stemmer), memory issues in running the stemmer  
> (keep the
> lexicon in memory?), and performance issues (multiple lookups in  
> the lexicon
> could make it much slower than a rule based stemmer?). In general,  
> any feedback
> would be appreciated.
>
> Thanks in advance,
>
> Marios Skounakis
> ---- Msg sent via eXis webmail
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message