lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatu Saloranta <t...@hypermall.net>
Subject Re: Phrase query and porter stemmer
Date Fri, 14 Feb 2003 03:09:15 GMT
On Thursday 13 February 2003 05:06, Mailing Lists Account wrote:
> Doug Cutting wrote:
> > Mailing Lists Account wrote:
..
> > That's because Google and most internet search engines never do any
> > stemming.
> >
> > Doug
>
> I didn't know that. Thanks.
>
> Generally speaking, are there any advantages not to apply the stemmer ?

Yes, I suspect there are.

There are 2 ways to think about this. First is that Google, arguably the best 
current general purpose search engine in the world does not use it. This 
indicates in itself that perhaps stemming is not very useful for general 
indexing/searching. Especially when doing phrase searches.

Second is that in case of internet search engines (or other search engines 
with massive amount of non-domain-specific data), stemming reduces accuracy 
of matching; and in case of huge data sets that's actually not a good thing. 
Instead of, say, 100 matches, you get 10000 matches, because stemming makes 
terms more general, matching more often.
Trying to find a needle from haystack if you will.

Stemming is probably more useful in reducing size of the index and improving 
performance that way. This used to be more important, when memory and 
performance limitations were stricter than nowadays.
Also, if you want to do semantic mapping and correlation, stemming is very 
useful (esp. combined with extensive list of stop words), as minimizing data 
sets used for correlation is essential for acceptable performance.

I think usefulness of stop words is closely related to usefulness of stemming 
(ie. more useful in some cases than others)

> Except for certain keywords,I found use of stemmers helpful.

I suspect this depends a lot on keywords in question. Unifying plurals and 
singulars is often helpful, but unifying words like "useful" and "useless" 
is, well, not very helpful (do they get stemmed to "use" like I would guess? 
or not?). Similarly, dropping stop words like "with", "without", "no"/"not" 
may result in dramatic loss in accuracy (ie. you get matches with pretty much 
"opposite" phrases when "not" is dropped by analyzer)

What do others think?

-+ Tatu +-



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message