lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Pohl (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4571) speedup disjunction with minShouldMatch
Date Mon, 26 Nov 2012 21:50:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504158#comment-13504158
] 

Stefan Pohl commented on LUCENE-4571:
-------------------------------------

Mikhail, Robert, thank you for following up on this and I'm happy to provide more details
on ideas of how to use the minimumMatch constraint to speed up the execution of disjunctive
queries.

Currently, one heap is used in disjunctive scorers to keep track of the next docid (next candidate),
and only then these candidates are ruled out if below minimum number of terms / percentage
of terms. If there is only one stop word or otherwise high docfreq term in the query, this
will produce a huge number of candidates that only occur in very few query terms and only
afterwards will be ruled out by the constraint again. As Robert pointed out, this leads to
long processing times to attain the next valid candidate. Using BooleanScorer will probably
only give improvement by a factor (in some cases), but also this scorers would still generate
all these useless candidates.
It would be much more efficient to actively use the constraint as part of query processing
and employ inverted list skipping (leap-frogging), which makes conjunctive queries so efficient,
also for these kind of queries.

An efficient implementation could look like the following:
Assume at least 5 out of 20 query terms have to match and imagine the terms/inverted lists
to be sorted by the next docid. Then, the first potential candidate that satisfies the constraint
would be the docid at position 5 and all of the inverted list at position 0-4 can be safely
advanced and have to match to that docid in order to generate a real candidate worth scoring.
The order, in which to try advancing terms 0-4 should probably be guided by the sparsity of
the lists (guided by either docfreq, if available, or a heuristic such as the distance to
the next docid), that is, inverted list 4 should first be advanced to the docid, then list
3,2,1, if previous advances are successful. Otherwise, the algorithm can start all over and
try the next docid at 5th position. This leads to effective leap-frogging on the most sparse
lists within the constraint which rules out matches not satisfying the constraint, most of
the time without even touching the very high-frequency terms. Improvements all depend on the
type of queries and number of docs per index.

There are different implementations possible to achieve such a behavior:
1) Using the current heap and pull out the first 5 lists to get to the next candidate docid
to which the other 4 then have to be advanced. This probably leads to more heap removal and
addition operations than necessary with the next approaches.
2) Always keep the first 5 lists fully sorted and use the heap to keep track of the next docid
within lists 6-20. This invariant should simplify implementation and be competitive for small
minimumMatch.
3) Use two heaps: A max-heap to keep track of the largest docid within the first 5 inverted
lists and a min-heap to keep track of the smallest docid within lists 6-20. This approach
looks most efficient.

These implementations look like generalizations to pure disjunctive queries, but if there
should be any impact they could be implemented as specializations used only for queries with
a minimum match constraint.
                
> speedup disjunction with minShouldMatch 
> ----------------------------------------
>
>                 Key: LUCENE-4571
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4571
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 4.1
>            Reporter: Mikhail Khludnev
>
> even minShouldMatch is supplied to DisjunctionSumScorer it enumerates whole disjunction,
and verifies minShouldMatch condition [on every doc|https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/DisjunctionSumScorer.java#L70]:
> {code}
>   public int nextDoc() throws IOException {
>     assert doc != NO_MORE_DOCS;
>     while(true) {
>       while (subScorers[0].docID() == doc) {
>         if (subScorers[0].nextDoc() != NO_MORE_DOCS) {
>           heapAdjust(0);
>         } else {
>           heapRemoveRoot();
>           if (numScorers < minimumNrMatchers) {
>             return doc = NO_MORE_DOCS;
>           }
>         }
>       }
>       afterNext();
>       if (nrMatchers >= minimumNrMatchers) {
>         break;
>       }
>     }
>     
>     return doc;
>   }
> {code}
> [~spo] proposes (as well as I get it) to pop nrMatchers-1 scorers from the heap first,
and then push them back advancing behind that top doc. For me the question no.1 is there a
performance test for minShouldMatch constrained disjunction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message