lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Pohl (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-4571) speedup disjunction with minShouldMatch
Date Fri, 22 Mar 2013 09:55:16 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stefan Pohl updated LUCENE-4571:
--------------------------------

    Attachment: LUCENE-4571.patch

With your nice test, I was quickly able to track it down (a >= instead of a == fixed the
issue). This was also present in the previous implementation, but there it was not wrong.
New patch attached.
There are still some assertions failing, but for now I'd attribute them to problems with the
test (are return statements is missing when actual==null?), because these assertions only
fail within the baseline/expected impl. Lucenebench doesn't complain anymore, even when checking
scores.

Instead of re-implementing all of the previous basic implementation, do you think it would
be possible to rather re-use a normal DisjunctionSumScorer and wrap it with a Scorer that
discards candidates that don't match the mm-constraint? This way we would get correct scoring
for free. E.g.
{noformat}
        public int nextDoc() throws IOException {
          int docid;
          while (true) {
            docid = wrappedScorer.nextDoc();
            if (docid==NO_MORE_DOCS)
              break;
            // count how many (optional) subscorers match on the same docid, and check against
constraint
            int cntOpt = 0;
            for (ChildScorer cs : this.subscorers) {
              if (cs.child.docID() == docid && "SHOULD".equalsIgnoreCase(cs.relationship))
                cntOpt++;
            }
            if (cntOpt>=mm)
              break;
          }          
          return docid;
        }
{noformat}

I ran the perf test locally on the 1M doc index and got quite nice numbers. As we anticipated,
the cost-API is very beneficial and the now added per-candidate early-stopping (less advance()-calls
at the expense of additional if-statements) roughly doubles the speedups :)
{noformat}
                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev      
         Pct diff
     Low4MinShouldMatch0       36.41      (3.1%)       35.18      (5.1%)   -3.4% ( -11% -
   5%)
     Low3MinShouldMatch0       21.31      (6.6%)       21.18      (3.9%)   -0.6% ( -10% -
  10%)
     Low2MinShouldMatch0       14.54      (7.4%)       14.45      (3.3%)   -0.6% ( -10% -
  10%)
     HighMinShouldMatch0        9.85      (9.1%)        9.82      (2.8%)   -0.3% ( -11% -
  12%)
     Low1MinShouldMatch0       11.68      (8.3%)       11.65      (3.1%)   -0.3% ( -10% -
  12%)
     HighMinShouldMatch2       10.12      (9.4%)       10.95      (2.8%)    8.2% (  -3% -
  22%)
     Low1MinShouldMatch2       12.08      (8.6%)       13.60      (3.2%)   12.5% (   0% -
  26%)
     Low2MinShouldMatch2       15.23      (7.8%)       18.22      (4.2%)   19.6% (   7% -
  34%)
     HighMinShouldMatch3       10.33      (9.2%)       13.21      (3.2%)   27.9% (  14% -
  44%)
     Low3MinShouldMatch2       22.72      (7.3%)       30.85      (5.8%)   35.8% (  21% -
  52%)
     Low1MinShouldMatch3       12.49      (8.3%)       17.92      (4.4%)   43.5% (  28% -
  61%)
     HighMinShouldMatch4       10.55      (9.0%)       18.29      (4.8%)   73.3% (  54% -
  95%)
     Low2MinShouldMatch3       16.08      (7.2%)       30.38      (6.8%)   88.9% (  69% -
 110%)
     Low1MinShouldMatch4       12.84      (7.9%)       32.68      (9.6%)  154.6% ( 126% -
 186%)
     Low4MinShouldMatch2       45.39      (3.2%)      222.71     (16.0%)  390.6% ( 360% -
 423%)
     Low3MinShouldMatch3       24.70      (6.2%)      215.56     (29.8%)  772.9% ( 693% -
 862%)
     Low4MinShouldMatch3       45.71      (3.1%)      636.23     (53.4%) 1292.0% (1198% -
1390%)
     Low2MinShouldMatch4       16.48      (7.1%)      257.47     (47.6%) 1462.7% (1314% -
1633%)
     Low4MinShouldMatch4       45.81      (3.3%)      811.82     (60.4%) 1672.1% (1557% -
1794%)
     Low3MinShouldMatch4       24.87      (6.8%)      768.67    (119.4%) 2990.2% (2681% -
3344%)
{noformat}
Note that these numbers are not directly comparable to the others in this ticket, but they
give evidence that the new implementation is never (as far as this one query generalizes)
slower than the previous one.
                
> speedup disjunction with minShouldMatch 
> ----------------------------------------
>
>                 Key: LUCENE-4571
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4571
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 4.1
>            Reporter: Mikhail Khludnev
>         Attachments: LUCENE-4571.patch, LUCENE-4571.patch, LUCENE-4571.patch, LUCENE-4571.patch,
LUCENE-4571.patch, LUCENE-4571.patch
>
>
> even minShouldMatch is supplied to DisjunctionSumScorer it enumerates whole disjunction,
and verifies minShouldMatch condition [on every doc|https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/DisjunctionSumScorer.java#L70]:
> {code}
>   public int nextDoc() throws IOException {
>     assert doc != NO_MORE_DOCS;
>     while(true) {
>       while (subScorers[0].docID() == doc) {
>         if (subScorers[0].nextDoc() != NO_MORE_DOCS) {
>           heapAdjust(0);
>         } else {
>           heapRemoveRoot();
>           if (numScorers < minimumNrMatchers) {
>             return doc = NO_MORE_DOCS;
>           }
>         }
>       }
>       afterNext();
>       if (nrMatchers >= minimumNrMatchers) {
>         break;
>       }
>     }
>     
>     return doc;
>   }
> {code}
> [~spo] proposes (as well as I get it) to pop nrMatchers-1 scorers from the heap first,
and then push them back advancing behind that top doc. For me the question no.1 is there a
performance test for minShouldMatch constrained disjunction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message