lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sundling, Paul" <paul.sundl...@sonyconnect.com>
Subject default text type and stop words
Date Fri, 02 Nov 2007 22:53:28 GMT
I noticed very unexpected results when using stop words with and without
conditions using the default text type.
 
A normal query with a stop word returns no results as expected:
 
For example with 'an' being a stopword
 
  movieName:an (results: 0 since it's a stop word) 
  movieName:another (results 237)
 
  rating:PG-13  (results: 76095)
 
 
but if I put them together with AND, for normal non stop words like
'another' the result is less than or equal to the smaller results being
ANDed.  So adding another AND clause with a stop word query should have
0 results.
 
  rating:PG-13 AND movieName:another (results 46)
 
  rating:PG-13 AND movieName:an (results 76095 should be 0)
  
Commenting out the stop word filter from the text type for query will
correct this behavior, although I'm not sure that's a real solution.  So
instead of anding the stop word clause it seems to ignore it.  Even if
the actual problem is at the Lucene level, perhaps it would be worth
considering changes to the default to get around it.
 
Workaround:
 
   <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <!-- comment out to prevent strange behavior <filter
class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>-->
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>
 
Paul Sundling

Mime
View raw message