lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6687) MLT term frequency calculation bug
Date Thu, 02 May 2019 22:07:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832027#comment-16832027
] 

Michael McCandless commented on LUCENE-6687:
--------------------------------------------

OK thanks [~teofili] – I'll backport this soon.

> MLT term frequency calculation bug
> ----------------------------------
>
>                 Key: LUCENE-6687
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6687
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/query/scoring, core/queryparser
>    Affects Versions: 5.2.1, 6.0
>         Environment: OS X v10.10.4; Solr 5.2.1
>            Reporter: Marko Bonaci
>            Assignee: Tommaso Teofili
>            Priority: Major
>             Fix For: 5.2.2, master (9.0)
>
>         Attachments: LUCENE-6687.patch, LUCENE-6687.patch, LUCENE-6687.patch, LUCENE-6687.patch,
buggy-method-usage.png, solr-mlt-tf-doubling-bug-results.png, solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png,
solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png, solr-mlt-tf-doubling-bug.png, terms-accumulator.png,
terms-angry.png, terms-glass.png, terms-how.png
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In {{org.apache.lucene.queries.mlt.MoreLikeThis}}, there's a method {{retrieveTerms}}
that receives a {{Map}} of fields, i.e. a document basically, but it doesn't have to be an
existing doc.
> !solr-mlt-tf-doubling-bug.png|height=500!
> There are 2 for loops, one inside the other, which both loop through the same set of
fields.
> That effectively doubles the term frequency for all the terms from fields that we provide
in MLT QP {{qf}} parameter. 
> It basically goes two times over the list of fields and accumulates the term frequencies
from all fields into {{termFreqMap}}.
> The private method {{retrieveTerms}} is only called from one public method, the version
of overloaded method {{like}} that receives a Map: so that private class member {{fieldNames}}
is always derived from {{retrieveTerms}}'s argument {{fields}}.
>  
> Uh, I don't understand what I wrote myself, but that basically means that, by the time
{{retrieveTerms}} method gets called, its parameter fields and private member {{fieldNames}}
always contain the same list of fields.
> Here's the proof:
> These are the final results of the calculation:
> !solr-mlt-tf-doubling-bug-results.png|height=700!
> And this is the actual {{thread_id:TID0009}} document, where those values were derived
from (from fields {{title_mlt}} and {{pagetext_mlt}}):
> !terms-glass.png|height=100!
> !terms-angry.png|height=100!
> !terms-how.png|height=100!
> !terms-accumulator.png|height=100!
> Now, let's further test this hypothesis by seeing MLT QP in action from the AdminUI.
> Let's try to find docs that are More Like doc {{TID0009}}. 
> Here's the interesting part, the query:
> {code}
> q={!mlt qf=pagetext_mlt,title_mlt mintf=14 mindf=2 minwl=3 maxwl=15}TID0009
> {code}
> We just saw, in the last image above, that the term accumulator appears {{7}} times in
{{TID0009}} doc, but the {{accumulator}}'s TF was calculated as {{14}}.
> By using {{mintf=14}}, we say that, when calculating similarity, we don't want to consider
terms that appear less than 14 times (when terms from fields {{title_mlt}} and {{pagetext_mlt}}
are merged together) in {{TID0009}}.
> I added the term accumulator in only one other document ({{TID0004}}), where it appears
only once, in the field {{title_mlt}}. 
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png|height=500!
> Let's see what happens when we use {{mintf=15}}:
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png|height=500!
> I should probably mention that multiple fields ({{qf}}) work because I applied the patch:
[SOLR-7143|https://issues.apache.org/jira/browse/SOLR-7143].
> Bug, no?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message