lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Stopwords
Date Wed, 17 Mar 2010 15:48:17 GMT

On Mar 16, 2010, at 9:51 PM, blargy wrote:

> 
> I was reading "Scaling Lucen and Solr"
> (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr/)
> and I came across the section StopWords. 
> 
> In there it mentioned that its not recommended to remove stop words at index
> time. Why is this the case? Don't all the extraneous stopwords bloat the
> index and lead to less relevant results? Can someone please explain this to
> me. Thanks

Yes and no.  Putting our historian hat on, stop words were often seen as contributing very
little to scores and also taking up a lot of room on disk back in the days when disk was very
precious.  Times, as they say, have changed.  Disk is cheap, so that is no longer a concern.
 

Think about stop words a little bit from a language perspective, while it is true that they
are of little value in search, they are not of "no value" (if they are of no value in a language,
one could argue that the word shouldn't even exist, right?).  This is especially true when
the user enters a query that is entirely stop words (for instance, there is a band called
"The THE").  Thus, the trick becomes knowing when to use stop words and when not to.  If you
remove them at indexing time, you have no choice, as the information is lost, so that is why
more and more people keep them during indexing and then deal with them at query time.  Turns
out, stop words are often also useful as part of phrases.  Consider the following two documents:

1. The President of the United States went to China last week.
2. Joe is the President.  The United States is investigating him for corruption.

If the user enters the query "The President of the United States" and stop words are removed
at indexing and search time, then both documents will match, whereas with stop words, the
first is the only (and correct) match at least based on my intent.

To deal with them at query time, you need an intelligent query parser that:
1. Recognizes when the query is all stop words
2. Keeps stop words as part of phrases

Unfortunately, none of the existing Solr Query Parsers address these two things.

HTH,
Grant


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search


Mime
View raw message