lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <>
Subject Re: Default stop word list
Date Tue, 30 Aug 2016 01:18:41 GMT
Do not remove stop words. Want to search for “vitamin a”? That won’t work.

Stop word removal is a hack left over from when we were running search engines in 64 kbytes
of memory.

Yes, common words are less important for search, but removing them is a brute force approach
with severe side effects. Instead, we use a proportional approach with the tf.idf model. That
puts a higher weight on rare words and a lower weight on common words.

For some real-life examples of problems with stop words, you can read the list of movie titles
that disappear with stemming and stop words. I discovered these when I was running search
at Netflix.

	• Being There (this is the first one I noticed)
	• To Be and To Have (Être et Avoir)
	• To Have and To Have Not
	• Once and Again
	• To Be or Not To Be (1942) (OK, it isn’t just a quote from Hamlet)
	• To Be or Not To Be (1983)
	• Now and Then, Here and There
	• Be with Me
	• I’ll Be There
	• It Had to Be You
	• You Should Not Be Here
	• You Are Here

Walter Underwood  (my blog)

> On Aug 29, 2016, at 5:39 PM, Steven White <> wrote:
> Thanks Shawn.  This is the best answer I have seen, much appreciated.
> A follow up question, I want to remove stop words from the list, but if I
> do, then search quality will degradation (and index size will grow (less of
> an issue)).  For example, if I remove "a", then if someone search for "For
> a Few Dollars More" (without quotes) chances are good records with "a" will
> land higher up that are not relevant to user's search.  How can I address
> this?  Can I setup my schema so that records that get hits against a list
> of words, let's say off the stop word list, are ranked lower?
> Steve
> On Sat, Aug 27, 2016 at 2:53 PM, Shawn Heisey <> wrote:
>> On 8/27/2016 12:39 PM, Shawn Heisey wrote:
>>> I personally think that stopword removal is more of a problem than a
>>> solution.
>> There actually is one thing that a stopword filter can dothat has little
>> to do with the purpose it was designed for.  You can make it impossible
>> to search for certain words.
>> Imagine that your original data contains the word "frisbee" but for some
>> reason you do not want anybody to be able to locate results using that
>> word.  You can create a stopword list containing just "frisbee" and any
>> other variations that you want to limit like "frisbees", then place it
>> as a filter on the index side of your analysis.  With this in place,
>> searching for those terms will retrieve zero results.
>> Thanks,
>> Shawn

View raw message