lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Pohl (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4571) speedup disjunction with minShouldMatch
Date Mon, 11 Mar 2013 23:33:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13599481#comment-13599481
] 

Stefan Pohl commented on LUCENE-4571:
-------------------------------------

Awesome co-op. Thanks, Robert & Mike, for picking this up.

One comment to 'deferring scoring': I don't know about all current use-cases for these Scorers,
but if there are some that require only matching, then it is probably most efficient to have
respective specializations for each Scorer to either only match or match+score. Independently,
this appears to be an orthogonal consideration to separate matching from scoring within Scorers,
e.g. for not having to have such separate specializations.

If you're just after saving some cycles for not to have a minor response time decrease for
some queries, then it won't help as much for the optimized MinShouldMatchScorer as for the
previous implementation because it now generates (and scores) much less candidates for each
of which it is now much more likely to pass the MinShouldMatch-constraint and most of those
will hence be scored anyways (in use-cases where scoring is required). This is probably what
you mean by 'this is not helpful to do if you are scoring'?

It would be awesome to have that cost-API for (sub-)Scorers, as most Scorers can be rewritten
to benefit from it (wow, you could even demonstrate this for conjunctive queries) and it also
allows some optimizations to work with structured queries that otherwise would have a reduced
scope to only work on flat bag-of-TermScorers queries.
I would second that rewriting the attached new MinShouldMatchScorer to use the cost-API, that
is, always excluding the very same most costly subScorers and heap-merging only the remaining
ones would save quite a few heap operations and also simplify the implementation. This probably
amounts to the desired ~15% response time improvement for the little restrictive mm-constraint
queries so that it convincingly supersedes the previous MinShouldMatchScorer implementation.

Looking forward to see the impact of this optimized MinShouldMatchScorer to the runtimes of
use-cases such as:
http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html
                
> speedup disjunction with minShouldMatch 
> ----------------------------------------
>
>                 Key: LUCENE-4571
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4571
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 4.1
>            Reporter: Mikhail Khludnev
>         Attachments: LUCENE-4571.patch, LUCENE-4571.patch, LUCENE-4571.patch
>
>
> even minShouldMatch is supplied to DisjunctionSumScorer it enumerates whole disjunction,
and verifies minShouldMatch condition [on every doc|https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/DisjunctionSumScorer.java#L70]:
> {code}
>   public int nextDoc() throws IOException {
>     assert doc != NO_MORE_DOCS;
>     while(true) {
>       while (subScorers[0].docID() == doc) {
>         if (subScorers[0].nextDoc() != NO_MORE_DOCS) {
>           heapAdjust(0);
>         } else {
>           heapRemoveRoot();
>           if (numScorers < minimumNrMatchers) {
>             return doc = NO_MORE_DOCS;
>           }
>         }
>       }
>       afterNext();
>       if (nrMatchers >= minimumNrMatchers) {
>         break;
>       }
>     }
>     
>     return doc;
>   }
> {code}
> [~spo] proposes (as well as I get it) to pop nrMatchers-1 scorers from the heap first,
and then push them back advancing behind that top doc. For me the question no.1 is there a
performance test for minShouldMatch constrained disjunction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message