lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marios Skounakis <msc...@exis.com.gr>
Subject Re: Stemmer Implementation Strategy - feedback?
Date Tue, 08 Aug 2006 06:23:23 GMT
Hi Grant,

Thanks for the interesting reply.

Grant Ingersoll wrote:
> Hey Marios,
>
> It sounds like you have a reasonable plan and you have thought through 
> the ideas.  And the answer to many of your questions below is "it 
> depends".
>
> Do you have enough memory to hold the whole lexicon in memory?  Is 
> this lexicon going to grow significantly over time? I have, in the 
> past (for other lexicon based resources), done similar tricks to what 
> Lucene does with the term dictionary if things get too large.  Namely, 
> store the terms in lexicographic order and load every X number of 
> terms in memory (X is 128 or 64) as an index into the file.  This 
> causes a little more searching than if everything is in memory, but 
> you don't have a choice if the lexicon is really large.
Well, for my own purposes I can probably arrange to have the whole 
lexicon in memory. However, I directed the question to the lucene list 
in order to find out what people think about the general case - is the 
stemmer class allowed to use, say, 1 MB of memory?
> Is your lexicon approach going to be complete?  I don't know Greek, so 
> I don't know if you have a fixed set of roots.  Also, I don't know if 
> Greek has a notion of light stemming versus more aggressive stemming.  
> In Arabic, we found we had better results by doing light stemming (as 
> have other researchers.)
I suspect a literature search will give me a lot of information on light 
versus aggressive stemming, but I'll go ahead and ask anyway :). Can you 
give an example (not in Arabic please, unless you also provide the 
English translations) of a case where light stemming is preferable to 
more aggressive stemming? This direction sounds interesting because it 
may be pretty easy to implement rule-based light stemming for Greek. I 
had actually considered it as a solution and estimated that it would 
give you 80-90% of what a full stemmer would give you - but I had never 
thought it could actually give you *better* results, like you say.
>
> It appears to me, based on your research, you don't have much other 
> choice.  As for performance, you approach is much faster than your 
> alternative :-) (i.e. doing it by hand.)  Writing the stemmer seems 
> pretty easy, so I would go for it and then test it to see if it meets 
> your needs and, then, if you can, share it with others here.
>
> -Grant
>
> On Aug 4, 2006, at 1:29 PM, Marios Skounakis wrote:
>
>>
>>
>>
>> Hi all,
>>
>> The contrib section of Lucene contains a Greek Analyzer, which 
>> however only does
>> some letter normalization (capitals to lowercase, accent removal) and 
>> basic stop
>> word removal.
>>
>> I am interested in creating a Stemmer for the Greek Language to use 
>> with Lucene
>> (i.e. implement it as an analyzer). The Greek Language is quite 
>> different from
>> English (and most latin-related languages) in that it is highly 
>> inflectional(?) -
>> meaning that there is a large number of suffixes, many of which are 
>> not produced
>> in a very straightforward way.
>>
>> A quick internet search did not return much information - a couple of
>> non-publicly available papers and a Master's Thesis with a javascript
>> implementation which, however, seems to be somewhat lacking in 
>> precision (i.e.
>> produces erroneous stems). A disappointing picture, admittedly, when 
>> for the
>> english language it is so easy to find a public domain high quality 
>> stemmer like
>> Porter's...
>>
>> Anyway, to cut a long story short, I had the following idea in order 
>> to counter
>> the problem of the multiple suffixes and the high inflectional nature 
>> of the
>> language: implement the stemmer using a combination of a lexicon of 
>> stems (of the
>> most common words) and a list of all possible suffixes. The algorithm 
>> for finding
>> the stem of a word would be something like:
>>
>> - for each suffix in the list of suffixes
>>    - remove the suffix from the word (if possible), producing a 
>> candindate stem
>>    - search the lexicon of stems for the candidate stem
>>        - if the search is succesful return the candidate stem
>>        - if the search is unsuccesful go to next suffix
>> - if the suffixes are exhausted and no match is found, the word 
>> cannot be stemmed
>> (return the original word)
>>
>> [By the way the algorithm is inspired by a paper which descripbes the
>> implementation of a lemmatizer in a similar way - citeseer link:
>> http://citeseer.ist.psu.edu/694579.html]
>>
>> The question is:
>>
>> Is such a strategy that depends on a leixcon of predefined stems for 
>> implementing
>> a stemmer considered a major drawback? In theory it can be (compared 
>> to an
>> algorith that works purely with rules, like Porter's) a drawback, but in
>> practice, with a lexicon of a few thousand stems, the stemmer could 
>> achieve
>> pretty good recall (and good precision too).
>>
>> Other issues to comment on are the lexicon size (which will have to 
>> be embedded
>> in or accompany the stemmer), memory issues in running the stemmer 
>> (keep the
>> lexicon in memory?), and performance issues (multiple lookups in the 
>> lexicon
>> could make it much slower than a rule based stemmer?). In general, 
>> any feedback
>> would be appreciated.
>>
>> Thanks in advance,
>>
>> Marios Skounakis
>> ---- Msg sent via eXis webmail
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> --------------------------
> Grant Ingersoll
> Sr. Software Engineer
> Center for Natural Language Processing
> Syracuse University
> 335 Hinds Hall
> Syracuse, NY 13244
> http://www.cnlp.org
>
> Voice: 315-443-5484
> Skype: grant_ingersoll
> Fax: 315-443-6886
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message