lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment
Date Sun, 10 Oct 2010 11:22:30 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless updated LUCENE-2690:
---------------------------------------

    Attachment: LUCENE-2690-hack.patch


I attached a hacked patch... nowhere near committable, various tests
fail, etc... yet I think once we clean it up, the approach is viable.

I started from the patch like 2 iterations ago, and then fixed how the
MTQ BQ rewrite works so that instead of the two passes (first to
gather matching terms, second to create weight/scorers & run the BQ),
it now makes a single pass.

In that single pass it records which terms matched which segments, and
creates TermScorer for each.

After the single pass, once we've summed up the top level docFreq for
all terms, I go back and reset the weights for all the TermScorers,
sumSQ them, normalize, etc., and then create a FakeQuery object whose
only purpose is to remember the per-segment scorers and provide them
once .scorer(...) is called on each segment.

The big gain with this approach is you don't waste effort trying to
seek to non-existent terms in the sub readers.  Normally the terms
cache would save you here, but, we never cache a miss and so when we
try to look that up again it's always a real (costly) seek.

With this approach we can disable using the terms cache entirely from
MTQ.rewrite, which is great.

I believe the patch works correctly, at least for this test, because
on my 10M wikipedia index it gets identical top N results as clean
trunk.  Here're the perf gains:

||Query||QPS clean||QPS mtqseg||Pct diff||||
|state|37.49|37.40|{color:red}-0.2%{color}|
|unit*|11.86|20.23|{color:green}70.5%{color}|
|un*d|13.58|30.85|{color:green}127.2%{color}|
|uni*ed|173.22|535.27|{color:green}209.0%{color}|
|u*d|2.61|9.05|{color:green}247.3%{color}|
|un*ed|33.59|120.32|{color:green}258.1%{color}|

Note that these gains already include the sizable gains from the
original patch, but the single pass approach makes further great
gains, especially eg on the prefix query.

I don't think we should couple this new patch w/ this issue... this
issue already has awesome gains with a fairly minor change...
I'll open a new issue.


> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690-hack.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch,
LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the
auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper
on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional
datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but
FuzzyQuery's tests should not, because it depends for the minimum score handling, that the
terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message