lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com.INVALID>
Subject Re: BlendedTermQuery causing negative IDF?
Date Tue, 19 Apr 2016 15:33:52 GMT
Thanks Dough for letting us know that Lucene's BM25 avoids negative IDF values.
I didn't know that. 

Markus, out of curiosity, why do you need BlendedTermQuery?
I knew SynonymQuery is now part of query parser base, I think they do similar things?

Ahmet




On Tuesday, April 19, 2016 5:33 PM, Doug Turnbull <dturnbull@opensourceconnections.com>
wrote:
Lucene's BM25 avoids negatives scores for this by adding 1 inside the log
term of BM25's IDF

Compare this:
https://github.com/apache/lucene-solr/blob/5e5fd662575105de88d8514b426bccdcb4c76948/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L71

to the Wikipedia article's BM25 IDF
https://en.wikipedia.org/wiki/Okapi_BM25

Markus another thing to add is that when Elasticsearch uses
BlendedTermQuery, they add a lot of invariants that must be true. For
example the fields must share the same analyzer. You may need to research
what else happens in Elasticsearch outside BlendedTermQuery to fet this
behavior to work.

Another testing philosophy point: when I do this kind of work I like to
isolate the Lucene behavior seperate from the Solr behavior. I might
suggest creating a Lucene unit test to validate your assumptions around
BlendedTermQuery. Just to help isolate the issues. Here's Lucene's tests
for BlendedTermQuery as a basis

https://github.com/apache/lucene-solr/blob/5e5fd662575105de88d8514b426bccdcb4c76948/lucene/core/src/test/org/apache/lucene/search/TestBlendedTermQuery.java









On Tue, Apr 19, 2016 at 10:16 AM Ahmet Arslan <iorixxx@yahoo.com.invalid>
wrote:

>
>
> Hi Markus,
>
> It is a known property of BM25. It produces negative scores for common
> terms.
> Most of the term-weighting models are developed for indices in which stop
> words are eliminated.
> Therefore, most of the term-weighting models have problems scoring common
> terms.
> By the way, DFI model does a decent job when handling common terms.
>
> Ahmet
>
>
>
> On Tuesday, April 19, 2016 4:48 PM, Markus Jelsma <
> markus.jelsma@openindex.io> wrote:
> Hello,
>
> I just made a Solr query parser for BlendedTermQuery on Lucene 6.0 using
> BM25 similarity and i have a very simple unit test to see if something is
> working at all. But to my surprise, one of the results has a negative
> score, caused by a negative IDF because docFreq is higher than docCount for
> that term on that field. Here are the test documents:
>
>     assertU(adoc("id", "1", "text", "rare term"));
>     assertU(adoc("id", "2", "text_nl", "less rare term"));
>     assertU(adoc("id", "3", "text_nl", "rarest term"));
>     assertU(commit());
>
> My query parser creates the following Lucene query:
> BlendedTermQuery(Blended(text:rare text:term text_nl:rare text_nl:term))
> which looks fine to me. But this is what i am getting back for issueing
> that query on the above set of documents, the third document is the one
> with a negative score.
>
> <result name="response" numFound="3" start="0" maxScore="0.1805489">
>   <doc>
>     <str name="id">3</str>
>     <float name="score">0.1805489</float></doc>
>   <doc>
>     <str name="id">2</str>
>     <float name="score">0.14785346</float></doc>
>   <doc>
>     <str name="id">1</str>
>     <float name="score">-0.004004207</float></doc>
> </result>
> <lst name="debug">
>   <str name="rawquerystring">{!blended fl=text,text_nl}rare term</str>
>   <str name="querystring">{!blended fl=text,text_nl}rare term</str>
>   <str name="parsedquery">BlendedTermQuery(Blended(text:rare text:term
> text_nl:rare text_nl:term))</str>
>   <str name="parsedquery_toString">Blended(text:rare text:term
> text_nl:rare text_nl:term)</str>
>   <lst name="explain">
>     <str name="3">
> 0.1805489 = max plus 0.01 times others of:
>   0.1805489 = weight(text_nl:term in 2) [], result of:
>     0.1805489 = score(doc=2,freq=1.0 = termFreq=1.0
> ), product of:
>       0.18232156 = idf(docFreq=2, docCount=2)
>       0.9902773 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.5 = avgFieldLength
>         2.56 = fieldLength
> </str>
>     <str name="2">
> 0.14785345 = max plus 0.01 times others of:
>   0.14638956 = weight(text_nl:rare in 1) [], result of:
>     0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0
> ), product of:
>       0.18232156 = idf(docFreq=2, docCount=2)
>       0.8029196 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.5 = avgFieldLength
>         4.0 = fieldLength
>   0.14638956 = weight(text_nl:term in 1) [], result of:
>     0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0
> ), product of:
>       0.18232156 = idf(docFreq=2, docCount=2)
>       0.8029196 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.5 = avgFieldLength
>         4.0 = fieldLength
> </str>
>     <str name="1">
> -0.004004207 = max plus 0.01 times others of:
>   -0.20021036 = weight(text:rare in 0) [], result of:
>     -0.20021036 = score(doc=0,freq=1.0 = termFreq=1.0
> ), product of:
>       -0.22314355 = idf(docFreq=2, docCount=1)
>       0.89722675 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.0 = avgFieldLength
>         2.56 = fieldLength
>   -0.20021036 = weight(text:term in 0) [], result of:
>     -0.20021036 = score(doc=0,freq=1.0 = termFreq=1.0
> ), product of:
>       -0.22314355 = idf(docFreq=2, docCount=1)
>       0.89722675 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.0 = avgFieldLength
>         2.56 = fieldLength
> </str>
>
> What am i doing wrong? Or did i catch a bug?
>
> Thanks,
> Markus
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message