lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Elschot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7580) Spans tree scoring
Date Sun, 04 Dec 2016 14:58:58 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15720077#comment-15720077
] 

Paul Elschot commented on LUCENE-7580:
--------------------------------------

What SpansTreeQuery does not do, and some rough edges:

The SpansDocScorer objects do the match recording and scoring, and there is one for each Spans.
These SpansDocScorer objects might be merged into their Spans to reduce the number of objects.
Related: how to deal with the same term occurring in more than one subquery? See also LUCENE-7398.

Normally the term frequency score has a diminishing contribution for extra occurrences.
In the patch the slop factors for a term are applied in decreasing order on these diminished
contributions.
This requires sorting of the slop factors.
Sorting the slop factors could be avoided when an actual score of a single term occurrence
was available.
In that case the given slop factor could be used as a weight on that score.
It might be possible to estimate an actual score for a single term occurrence
from the distances to other occurrences of the same term.
Similarly, the decreasing term frequency contributions can be seen as a proximity weighting
for the same term (or subquery):
the closer a term occurs to itself, the smaller its contribution.
This might be refined by using the actual distances to other the term occurrences (or subquery
occurrences)
to provide a weight for each term occurrence. This is unusual because the weight decreases
for smaller distances.

The slop factor from the Similarity may need to be adapted because of the way it is combined
here
with diminishing term contributions.

Another use of a score of each term occurrence could be to use the absolute term position
to influence the score, possibly in combination with the field length.

There is an assert in TermSpansDocScorer.docScore() that verifies that
the smallest occurring slop factor is at least as large as the non matching slop factor.
This condition is necessary for consistency.
Instead of using this assert, this condition might be enforced by somehow
automatically determining the non matching slop factor.

This is a prototype. No profiling has been done, it will take more CPU, but I have no idea
how much.
The sorting of the slop factors per matching term occurrence has roughly the same
time complexity as the position priority queues used for SpanOr and SpanNear.
Garbage collection might be affected by the reference cycles between the SpansDocScorers
and their Spans.

Since this allows weighting of subqueries, it might be possible to implement synonym scoring
in SpanOrQuery by providing good subweights, and wrapping the whole thing in SpansTreeQuery.
The only thing that might still be needed then is a SpansDocScorer that applies the SimScorer.score()
over the total term frequency of the synonyms in a document.

SpansTreeScorer multiplies the slop factor for nested near queries at each level.
Alternatively a minimum distance could be passed down.
This would need to change recordMatch(float slopFactor) to recordMatch(int minDistance).
Would minDistance make sense, or is there a better distance?

What is a good way to test whether the score values from SpansTreeQuery actually improve on
the score values from the current SpanScorer?

There are no tests for SpanFirstQuery/SpanContainingQuery/SpanWithinQuery.
These tests are not there because these queries provide FilterSpans and that is already supported
for SpanNotQuery.

The explain() method is not implemented for SpansTreeQuery.
This should be doable with an explain() method added to SpansTreeScorer to provide the explanations.

There is no support for PayloadSpanQuery.
PayloadSpanQuery is not in here because it is not in the core module.
I think it can fit here in because PayloadSpanQuery also scores per matching term occurrence.
Then Spans.doStartCurrentDoc() and Spans.doCurrentSpans() could be removed.

In case this is acceptable as a good way to score Spans:
Spans.width() and Scorer.freq() and SpansDocScorer.docMatchFreq() might be removed.
Would it make sense to implement child Scorers in the tree of SpansDocScorer objects?


> Spans tree scoring
> ------------------
>
>                 Key: LUCENE-7580
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7580
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: master (7.0)
>            Reporter: Paul Elschot
>            Priority: Minor
>             Fix For: 6.x
>
>         Attachments: LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and what matched



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message