lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leo Galambos <Le...@seznam.cz>
Subject Re: SnowballAnalyzer
Date Tue, 07 Oct 2003 23:13:47 GMT
Hi Pete,

IMHO you could also use stemmers which are 1) faster 2) more accurate 3) 
able to learn and process *any* language 4) able to work as 
lemmatiser/guesser. I know two algorithms which have all the properties:

The first one is based on Jan Daciuk's MFSA, and the second one is, ehm 
no self-promotion ;-), my method. The comparison of these two methods is 
here: http://www.egothor.org/temp/us-0E2-cmp.png (English dictionary)

My method was designed for IR systems thus it gives better accuracy in 
such environments. I was also interested in compound words (->German) 
thus I can offer you a multilevel stemmer which do the job. Elsewhere 
you may have better results with Jan's method.

Leo

Pete Lewis wrote:

>Hi all
>
>I know that I have no vote but I think that it would be wrong to bring the SnowballAnalyzer
into the core.
>
>There are some distinct limitations with this pure algorithmic approach.  Yes it would
be great to say 'hey, we have 14 languages covered' but you should first realise the limitations
of the product.  Lets start with some definitions....
>
>'Stemming' signifies the process of finding the stems in words. 'Lemmatisation' is the
process of reducing the word form to its 'lemma' form, i.e. the form one expects to find in
a dictionary. The difference are:
>
>1.      In many language the dictionary form is not the stem. E.g. in Dutch the infinitive
verb is not its stem.
>
>2.      Words may have several stems due to composition (common in Germanic languages).
>
>The terms are both used extremely loosely in the literature, where they often indicate
the same thing.
>
>
>
>A tool often used for English is the Porter-stemmer. Strictly speaking, it is neither
a stemmer nor a lemmatiser; it cuts off certain characters on the basis of characters before
them. In many cases morphologically equivalent forms reduce to the same root form. There have
been efforts to create similar type algorithmic tools for other languages. Porter has lately
designed a language called Snowball, to create scripts for performing these reductions. Snowball
has been applied for a number of languages. In many cases these scripts are available for
the public. Snowball is not capable of handling composition. Nor is it capable of handling
other more demanding morphological patterns, such as agglutination and infixes.
>
>
>
>Basically people would expect the terms in the search clue to be reduced to the same root
form as that used for indexing and hence would then be able to find the different derivations
of the term (plurals etc).
>
>
>
>Some examples from Snowball should speak for themselves:
>
>
>
>bus -> bus
>
>buses -> buse
>
>catch -> catch
>
>caught -> caught
>
>manage -> manag
>
>management -> manag
>
>
>
>showing incorrect handling of plurals, irregs, and mixing verbs & nouns.  Obviously
many other examples can be found.
>
>
>
>While this isn't too bad for English it gets pretty dire for other languages.
>
>
>
>For English I'd prefer KStem rather than Snowball.
>
>
>
>Cheers
>
>
>
>Pete
>
>
>
>
>
>----- Original Message ----- 
>From: "Erik Hatcher" <erik@ehatchersolutions.com>
>To: "Lucene List" <lucene-dev@jakarta.apache.org>
>Sent: Monday, October 06, 2003 6:49 PM
>Subject: SnowballAnalyzer
>
>
>  
>
>>At one point, I believe, it was proposed to bring the sandbox 
>>SnowballAnalyzer into the core.  Is this still desired or shall we just 
>>leave it in the sandbox?
>>
>>Erik
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>>
>>    
>>
>> 
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message