lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Hall <mh...@informatics.jax.org>
Subject Re: Stemming
Date Fri, 08 May 2009 12:57:59 GMT
Ganesh wrote:
> My opinion is Stemming process is to get the base word. Here it is not 
> doing so.
>
Unfortunately this is where your problem lies, stemming doesn't do this, 
it breaks words that are almost lexically equivalent down into a similar 
root word. thus cat = cats.

 From the wiki: "*Stemming* is the process for reducing inflected (or 
sometimes derived) words to their stem 
<http://en.wikipedia.org/wiki/Word_stem>, base or root 
<http://en.wikipedia.org/wiki/Root_%28linguistics%29> form – generally a 
written word form. The stem need not be identical to the morphological 
root <http://en.wikipedia.org/wiki/Morphological_root> of the word; it 
is usually sufficient that related words map to the same stem, even if 
this stem is not in itself a valid root. The algorithm 
<http://en.wikipedia.org/wiki/Algorithm> has been a long-standing 
problem in computer science 
<http://en.wikipedia.org/wiki/Computer_science>; the first paper on the 
subject was published in 1968. The process of stemming, often called 
*conflation <http://en.wikipedia.org/wiki/Conflation>*, is useful in 
search engines <http://en.wikipedia.org/wiki/Search_engine> for query 
expansion <http://en.wikipedia.org/wiki/Query_expansion> or indexing 
<http://en.wikipedia.org/wiki/Index_%28search_engine%29> and other 
natural language processing 
<http://en.wikipedia.org/wiki/Natural_language_processing> problems."

But the words hard, and harder mean different things (In the opinion of 
those who developed the Snowball algorithm), and as such shouldn't be 
stemming down to a single word.

Now, I find it to be an arguable point about hard and harder not being 
close enough to stem to the same root, but in order to get this effect 
you will need to either change the snowball algorithm, or process your 
words into a more base form before they go into the stemmed, which is a 
hairy road indeed ^^

Hope this helps.

Matt

-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message