lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: Understanding the Debug explanations for Query Result Scoring/Ranking
Date Fri, 25 Jul 2014 20:51:06 GMT
The formatting is one thing, but ultimately it is just a giant expression, 
one for each document. The expression is computing the score, based on your 
chosen or default "similarity" algorithm. All the terms in the expressions 
are detailed here:

http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

Unless you dive into that math (not so bad, really, if you are motivated), 
the expressions are going to be rather opaque to you.

The long floating point numbers are mostly just the intermediate (and final) 
calculations of the math described above.

Try constructing a very simple collection of simple, contrived documents, 
like a short sentence in each, with some common terms, and then try simply 
queries to see how the expression term values change. Try computing TF, DF, 
IDF yourself (just count the terms by hand), and compare to what debug gives 
you.

-- Jack Krupansky

-----Original Message----- 
From: O. Olson
Sent: Thursday, July 24, 2014 6:45 PM
To: solr-user@lucene.apache.org
Subject: Understanding the Debug explanations for Query Result 
Scoring/Ranking

Hi,

If you add /*&debug=true*/ to the Solr request /(and &wt=xml if your
current output is not XML)/, you would get a node in the resulting XML that
is named "debug". There is a child node to this called "explain" to this
which has a list showing why the results are ranked in a particular order.
I'm curious if there is some documentation on understanding these
numbers/results.

I am new to Solr, so I apologize that I may be using the wrong terms to
describe my problem. I also aware of
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
though I have not completely understood it.

My problem is trying to understand something like this:

1.5797625 = (MATCH) sum of: 0.4717142 = (MATCH) weight(text:televis in
44109) [DefaultSimilarity], result of: 0.4717142 = score(doc=44109,freq=1.0
= termFreq=1.0 ), product of: 0.71447384 = queryWeight, product of:
7.0424104 = idf(docFreq=896, maxDocs=377553) 0.10145303 = queryNorm 0.660226
= fieldWeight in 44109, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 =
termFreq=1.0 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.09375 =
fieldNorm(doc=44109) 1.1080483 = (MATCH) weight(text:tv in 44109)
[DefaultSimilarity], result of: 1.1080483 = score(doc=44109,freq=6.0 =
termFreq=6.0 ), product of: 0.6996622 = queryWeight, product of: 6.896415 =
idf(docFreq=1037, maxDocs=377553) 0.10145303 = queryNorm 1.5836904 =
fieldWeight in 44109, product of: 2.4494898 = tf(freq=6.0), with freq of:
6.0 = termFreq=6.0 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.09375 =
fieldNorm(doc=44109)

*Note:* I have searched for "televisions". My search field is a single
catch-all field. The Edismax parser seems to break up my search term into
"televis" and "tv"

Is there some documentation on how to understand these numbers. They do not
seem to be properly delimited. At the minimum, I can understand something
like:
1.5797625 =  0.4717142 + 1.1080483
and
0.71447384  = 7.0424104 * 0.10145303

But, I cannot understand if something like "0.10145303 = queryNorm 0.660226
= fieldWeight in 44109" is used in the calculation anywhere. Also since
there were only two terms /("televis" and "tv")/ I could use subtraction to
find out 1.1080483 was the start of a new result.

I'd also appreciate if someone can tell me which class dumps out the above
data. If I know it, I can edit that class to make the output a bit more
understandable for me.

Thank you,
O. O.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137.html
Sent from the Solr - User mailing list archive at Nabble.com. 


Mime
View raw message