lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment
Date Sat, 09 Oct 2010 18:44:30 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919528#action_12919528
] 

Michael McCandless commented on LUCENE-2690:
--------------------------------------------

We have to sort the terms coming out of the BytesRefHash, else we get bad seek performance
because the within-block seek opto will otherwise often fail to apply...

So I used a TreeMap instead of HashMap.

Then ran a quick perf test on 10 M Wikipedia index:

||Query||QPS clean||QPS mtqseg||Pct diff||||
|unit*|11.83|11.80|{color:red}-0.3%{color}|
|un*d|13.64|16.95|{color:green}24.3%{color}|
|u*d|2.67|3.77|{color:green}41.1%{color}|
|un*ed|34.85|74.94|{color:green}115.0%{color}|
|uni*ed|183.37|437.13|{color:green}138.4%{color}|

So these are good gains!  I can't run FuzzyQuery until we fix the tie-break problem...

I'm really not sure why the prefix query sees no gain yet the others do (I would have actually
expected the reverse, because PrefixTermsEnum's accept method is so simple).

> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the
auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper
on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional
datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but
FuzzyQuery's tests should not, because it depends for the minimum score handling, that the
terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message