lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject RE: Why do documents without the search query term rank highest
Date Tue, 01 Dec 2015 21:51:42 GMT

: Again, my confusion is why the document 'Home' appears ahead of the 
: document 'Big Mac' in the ranking when the query term 'big' only appears 
: once in 'Home' but several times in 'Big Mac'?

The key to understanding how documents are scored is in the query 
structure and the "explain" output.

By default the explain output is a simple string using newlines & 
whitespace indenting for formatting -- something that got lost when you 
pasted it into email -- but i've tried to reformat it below based on 
educated guesses and lots of experience. (FWIW: adding 
debug.explain.structured=true will use the xml/json/whatever response 
format for structure instead of newlines + indenting)

<str name="http://www-a4.staging.mcdonalds.com/us/en/home.html">

0.027089478 = (MATCH) product of: 
.0.18962634 = (MATCH) sum of: 
..0.18962634 = (MATCH) weight(keywords:big in 78) [DefaultSimilarity]
...0.18962634 = score(doc=78,freq=1.0 = termFreq=1.0 ), product of:
....0.3345638 = queryWeight, product of: 
.....5.18205 = idf(docFreq=3, maxDocs=262) 
.....0.06456205 = queryNorm 
....0.56678677 = fieldWeight in 78, product of: 
.....1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 
.....5.18205 = idf(docFreq=3, maxDocs=262) 
.....0.109375 = fieldNorm(doc=78) 
.0.14285715 = coord(1/7) 

So what the above tells us, is that the top scoring document (home.html) 
matched a single clause of the query which was "keywords:big".  The *term* 
"keywords:big" appeared 1 time (freq=1.0) in this document, and is in a 
total of 3 documents (docFreq). 

(note that *term* is key here -- the number of times the *word* big 
appears in all fields doesn't matter for score calculations, just that it 
appears in the "keywords" field for a total of 3 documents, and this is 
one of them)

There were "penalties" to the score for this document based on the 
"fieldNorm" of the keywords field (which comes from index time document & 
field boosts, as well as field length at index time) and because it only 
matched 1/7 of the clauses of the query.

Now lets compare with the second match....

<str name="http://www-a4.staging.mcdonalds.com/us/en/our_story/replacement-to-new-search/BigMac.html">

0.0075755017 = (MATCH) product of: 
.0.026514255 = (MATCH) sum of: 
..0.0146626085 = (MATCH) weight(description:big in 104) [DefaultSimilarity]
...0.0146626085 = score(doc=104,freq=3.0 = termFreq=3.0 ), product of:
....0.3345638 = queryWeight, product of: 
.....5.18205 = idf(docFreq=3, maxDocs=262) 
.....0.06456205 = queryNorm 
....0.043826047 = fieldWeight in 104, product of: 
.....1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0 
.....5.18205 = idf(docFreq=3, maxDocs=262) 
.....0.0048828125 = fieldNorm(doc=104) 
..0.011851646 = (MATCH) weight(title:big in 104) [DefaultSimilarity]
...0.011851646 = score(doc=104,freq=1.0 = termFreq=1.0 ), product of: 
....0.3345638 = queryWeight, product of: 
.....5.18205 = idf(docFreq=3, maxDocs=262) 
.....0.06456205 = queryNorm 
....0.035424173 = fieldWeight in 104, product of: 
.....1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 
.....5.18205 = idf(docFreq=3, maxDocs=262) 
.....0.0068359375 = fieldNorm(doc=104) 
.0.2857143 = coord(2/7)

In this case, the document matches two clauses of the query -- 
"description:big" and "title:big".  The term description:big is matched 3 
times (termFreq) in this document, and evidently exists in only 3 
documents in the index (docFreq) but the fieldNorm is penalizing the 
overall scores.  Likewise the term title:big is matched 1 time, and exists 
in only 3 documents in your index -- the fieldNorm is slightly higher 
(probably due to the shorter length of the title).  The overall score of 
the second doc is penalized for only matching 2 of the 7 clauses.

Based on what i'm seeing here, the biggest suprise i have is the fieldNorm 
values you are getting -- they don't make sense given the lengths of the 
fields you showed us in the output unless some index time document (or 
field) boosts are getting applied -- perhaps intended to "promote" the 
"home.html" page in your search results?  My guess is a some setting in 
your CMS is doing this?  maybe based on "page depth" or something like 
that?

Based on your configs, I'm guessing you're running Solr 4.2 -- So I tried 
loading up copies of those 2 documents using the config+schema you 
provided, and here are the score explanations i got...

**NOTE** Things like the docFreqs (and therfore queryWeight & 
fieldWeight) are NOT going to be comparable because my index *only* had 
those two documents ... the key here is to compare the fieldNorms below 
with the fieldNorms from the same documents in your query...


http://www-a4.staging.mcdonalds.com/us/en/home.html
0.004108005 = (MATCH) product of:
.0.028756034 = (MATCH) sum of:
..0.028756034 = (MATCH) weight(keywords:big in 0) [DefaultSimilarity],
...0.028756034 = score(doc=0,freq=1.0 = termFreq=1.0), product of:
....0.2629123 = queryWeight, product of:
.....1.0 = idf(docFreq=1, maxDocs=2)
.....0.2629123 = queryNorm
....0.109375 = fieldWeight in 0, product of:
.....1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
.....1.0 = idf(docFreq=1, maxDocs=2)
.....0.109375 = fieldNorm(doc=0)
.0.14285715 = coord(1/7)


http://www-a4.staging.mcdonalds.com/us/en/our_story/replacement-to-new-search/BigMac.html
0.07352274 = (MATCH) product of:
.0.25732958 = (MATCH) sum of:
..0.14230545 = (MATCH) weight(description:big in 1) [DefaultSimilarity],
...0.14230545 = score(doc=1,freq=3.0 = termFreq=3.0), product of:
....0.2629123 = queryWeight, product of:
.....1.0 = idf(docFreq=1, maxDocs=2)
.....0.2629123 = queryNorm
....0.54126585 = fieldWeight in 1, product of:
.....1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0
.....1.0 = idf(docFreq=1, maxDocs=2)
.....0.3125 = fieldNorm(doc=1)
..0.115024135 = (MATCH) weight(title:big in 1) [DefaultSimilarity],
...0.115024135 = score(doc=1,freq=1.0 = termFreq=1.0), product of:
....0.2629123 = queryWeight, product of:
.....1.0 = idf(docFreq=1, maxDocs=2)
.....0.2629123 = queryNorm
....0.4375 = fieldWeight in 1, product of:
.....1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
.....1.0 = idf(docFreq=1, maxDocs=2)
.....0.4375 = fieldNorm(doc=1)
.0.2857143 = coord(2/7)


...the fieldNorm for "home.html" is the same, but the fieldNorm(s) for 
BigMac.html are much higher.  The only explanation I have is that your CMS 
is sending fractional "boost" values at index time for some documents 
(again -- i speculate it might based on how "deep" the page is in your 
site, in an attempt to "promote" higher level pages)



-Hoss
http://www.lucidworks.com/

Mime
View raw message