lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Short DismaxRequestHandler Question
Date Fri, 07 May 2010 21:20:06 GMT

: The StopWordFilter (my implementation) removes specific types of words *and*
: all markers from all words.
: 
: This leads to a deletion of some parts of sentences.

Ah, yes i think you're running into the same confusion people have with 
dismax and stopwords -- there was a blog about this recently that 
explained it much better then i've ever been able to...

http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/
>> As long as each of those solr fields is configured for stopwords (and 
>> the same) stopwords, everything Just Works the way you’d expect.  But 
>> if one of those fields does not have stopwords configured, then 
>> (depending on your mm settings), you can easily end up getting zero 
>> hits for any (non-phrase) query clause that is a stopword.  This kind 
>> of makes sense when you think about it — since at least one field 
>> didn’t have stopwords, there was a clause included for that stopword 
>> you entered. 

(the blog post makes an incorrect assumption after that -- but the 
paragraph above is dead on)

: Let me be sure, that I have understood your part about how the
: DisMaxRequestHandler works.
: If I got 4 fields:
: name, colour, category, manufacturer
: 
: And an example-doc like this:
: title: iPhone
: colour: black
: category: smartphone
: manufacturer: apple
: 
: And I got a dismax-query like this:
: q=apple iPhone & qf=title^5 manufacturer & mm=100% 
: Than the whole thing will match (assumed that iPhone and /or apple where no
: stopwords)?

	correct

: Another example:
: title: "Solr in a production environment"
: cat: "tutorial"
: 
: At index-time, title is reduced to: "Solr production environment".
: A query like this "using Solr in a production environment"
: will be reduced to "Solr production environment".

...not neccessarily.  if you only have one field in your qf, and that 
feild defines "using", "in" and "a" as stopwords then that may be what 
your query turns into.

: However, if I got a "content" field, that indexes the content of the text
: without my markerFilter, this won't work, because the parsed query-strings
: are different??? I don't understand the problem 

(FWIW: "parsed query-strings" is an ambiguious statement (it could be 
refering to the Query object you get when parsing query strings, or it 
could refer to the toString value of the Query object you get after 
parsing)

The query string is not parsed "differnet" for each of your qf fields, it 
is parsed exactly once, and each "chunk" of the string (ie: a "word" or 
quoted phrase) is passed to the analyzer for each field -- if any one of 
those fields produces a valid stream of tokens for that input (ie: it's 
not a stopword) then that constitutes one clause -- even if only one field 
says it's a valid clause, it's still a valid clause, and it's factored in 
to the "min-should-match" (mm) amount.

Mike Klass explained this really well in a previous thread about stop 
words and dismax, where he showed the detailed query structure....

http://old.nabble.com/Re%3A-DisMax-request-handler-doesn%27t-work-with-stopwords--p11016770.html

...hopefully that structure will help make the behavior yo uare seeing 
clear.  I suggest you add debugQuery=true to your queries that are 
failing, and look closely at the parsedQuery_toString -- pay attention to 
the structure, and not how many clauses exsit for the main boolean query 
-- note the clauses of that query, and where you have clauses consisting 
exclusively of stopwords (in fields where stopwords are not removed).  If 
it's still not making sense please post that exact debug output.

-Hoss

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message