lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wunderw...@netflix.com>
Subject Re: default text type and stop words
Date Sat, 03 Nov 2007 00:11:15 GMT
Stopwords are fairly common in movie titles. There are even titles
made entirely of stopwords. The first one I noticed was "Being There".
I posted more of them here:

http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html

wunder
==
Search Guy
Netflix

On 11/2/07 3:53 PM, "Sundling, Paul" <paul.sundling@sonyconnect.com> wrote:

> I noticed very unexpected results when using stop words with and without
> conditions using the default text type.
>  
> A normal query with a stop word returns no results as expected:
>  
> For example with 'an' being a stopword
>  
>   movieName:an (results: 0 since it's a stop word)
>   movieName:another (results 237)
>  
>   rating:PG-13  (results: 76095)
>  
>  
> but if I put them together with AND, for normal non stop words like
> 'another' the result is less than or equal to the smaller results being
> ANDed.  So adding another AND clause with a stop word query should have
> 0 results.
>  
>   rating:PG-13 AND movieName:another (results 46)
>  
>   rating:PG-13 AND movieName:an (results 76095 should be 0)
>   
> Commenting out the stop word filter from the text type for query will
> correct this behavior, although I'm not sure that's a real solution.  So
> instead of anding the stop word clause it seems to ignore it.  Even if
> the actual problem is at the Lucene level, perhaps it would be worth
> considering changes to the default to get around it.
>  
> Workaround:
>  
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <!-- in this example, we will only use synonyms at query time
>         <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>         -->
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <!-- comment out to prevent strange behavior <filter
> class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>-->
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>
>  
> Paul Sundling


Mime
View raw message