lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Elschot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7580) Spans tree scoring
Date Sun, 04 Dec 2016 14:53:58 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15720066#comment-15720066
] 

Paul Elschot commented on LUCENE-7580:
--------------------------------------

In the patch, SpansTreeQuery is a wrapper for SpanQuery that uses basically the same scoring
as the scoring for other queries.
When all term occurrences match at top level or at 0 distance the score is the same as
the score for a boolean OR over the terms, independently of the Similarity that is used.
SpansTreeScorer scores each query term matching occurrence, and it applies discounts for non
matching terms
and for distance matches. It also uses weights of subqueries.

The matching occurrences are recorded per document in the spans tree at each top level match
of a document.
For each match SpansTreeScorer descends the tree down to the leaf level of the terms of each
match.
SpansDocScorer objects are used as the tree nodes, there is one for each supported Spans.

Each matching term occurrence is recorded with a slop factor.
At the top level this slop factor is normally 1, and for each span near nesting level
the slop factor at the match is multiplied into this.

The term frequency scoring from the Similarity is used per matching term occurrence,
and these term occurrence scores are weighted by the slop factors sorted in decreasing order.
The purpose of using the given slop factors in decreasing order is to provide scoring consistency
between span near queries that only differ in the maximum allowed slop.
This consistency requires that an extra match with a lower slop increases the score of the
document.
I would expect scoring to be consistent this way, but I'm not 100% sure.

The non matching term occurrences get a score that is the difference of
the normal document term frequency score and the term frequency score for the matching terms.
This non matching score is weighted by the slop factor of a non matching distance.
The non matching distance is a parameter that must be provided.
This non matching distance can for example be chosen as a little larger
than the largest distance used in the span near queries that are wrapped.

SpansTreeQuery is implemented for any combination of
SpanNearQuery, SpanOrQuery, SpanTermQuery, SpanBoostQuery,
SpanNotQuery, SpanFirstQuery, SpanContainingQuery and SpanWithinQuery.

See the javadocs and the test code on how to use SpansTreeQuery.


> Spans tree scoring
> ------------------
>
>                 Key: LUCENE-7580
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7580
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: master (7.0)
>            Reporter: Paul Elschot
>            Priority: Minor
>             Fix For: 6.x
>
>         Attachments: LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and what matched



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message