lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@syr.edu>
Subject Re: Stemmer Implementation Strategy - feedback?
Date Mon, 07 Aug 2006 11:34:15 GMT
Hey Marios,

It sounds like you have a reasonable plan and you have thought  
through the ideas.  And the answer to many of your questions below is  
"it depends".

Do you have enough memory to hold the whole lexicon in memory?  Is  
this lexicon going to grow significantly over time? I have, in the  
past (for other lexicon based resources), done similar tricks to what  
Lucene does with the term dictionary if things get too large.   
Namely, store the terms in lexicographic order and load every X  
number of terms in memory (X is 128 or 64) as an index into the  
file.  This causes a little more searching than if everything is in  
memory, but you don't have a choice if the lexicon is really large.

Is your lexicon approach going to be complete?  I don't know Greek,  
so I don't know if you have a fixed set of roots.  Also, I don't know  
if Greek has a notion of light stemming versus more aggressive  
stemming.  In Arabic, we found we had better results by doing light  
stemming (as have other researchers.)

It appears to me, based on your research, you don't have much other  
choice.  As for performance, you approach is much faster than your  
alternative :-) (i.e. doing it by hand.)  Writing the stemmer seems  
pretty easy, so I would go for it and then test it to see if it meets  
your needs and, then, if you can, share it with others here.

-Grant

On Aug 4, 2006, at 1:29 PM, Marios Skounakis wrote:

>
>
>
> Hi all,
>
> The contrib section of Lucene contains a Greek Analyzer, which  
> however only does
> some letter normalization (capitals to lowercase, accent removal)  
> and basic stop
> word removal.
>
> I am interested in creating a Stemmer for the Greek Language to use  
> with Lucene
> (i.e. implement it as an analyzer). The Greek Language is quite  
> different from
> English (and most latin-related languages) in that it is highly  
> inflectional(?) -
> meaning that there is a large number of suffixes, many of which are  
> not produced
> in a very straightforward way.
>
> A quick internet search did not return much information - a couple of
> non-publicly available papers and a Master's Thesis with a javascript
> implementation which, however, seems to be somewhat lacking in  
> precision (i.e.
> produces erroneous stems). A disappointing picture, admittedly,  
> when for the
> english language it is so easy to find a public domain high quality  
> stemmer like
> Porter's...
>
> Anyway, to cut a long story short, I had the following idea in  
> order to counter
> the problem of the multiple suffixes and the high inflectional  
> nature of the
> language: implement the stemmer using a combination of a lexicon of  
> stems (of the
> most common words) and a list of all possible suffixes. The  
> algorithm for finding
> the stem of a word would be something like:
>
> - for each suffix in the list of suffixes
>    - remove the suffix from the word (if possible), producing a  
> candindate stem
>    - search the lexicon of stems for the candidate stem
>        - if the search is succesful return the candidate stem
>        - if the search is unsuccesful go to next suffix
> - if the suffixes are exhausted and no match is found, the word  
> cannot be stemmed
> (return the original word)
>
> [By the way the algorithm is inspired by a paper which descripbes the
> implementation of a lemmatizer in a similar way - citeseer link:
> http://citeseer.ist.psu.edu/694579.html]
>
> The question is:
>
> Is such a strategy that depends on a leixcon of predefined stems  
> for implementing
> a stemmer considered a major drawback? In theory it can be  
> (compared to an
> algorith that works purely with rules, like Porter's) a drawback,  
> but in
> practice, with a lexicon of a few thousand stems, the stemmer could  
> achieve
> pretty good recall (and good precision too).
>
> Other issues to comment on are the lexicon size (which will have to  
> be embedded
> in or accompany the stemmer), memory issues in running the stemmer  
> (keep the
> lexicon in memory?), and performance issues (multiple lookups in  
> the lexicon
> could make it much slower than a rule based stemmer?). In general,  
> any feedback
> would be appreciated.
>
> Thanks in advance,
>
> Marios Skounakis
> ---- Msg sent via eXis webmail
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message