Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Date: Fri, 15 Jun 2012 16:24:42 +0000 (UTC)
From: "Michael McCandless (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Message-ID: <234032590.19172.1339777482828.JavaMail.jiratomcat@issues-vm>
In-Reply-To: 
 <779773358.14423.1337355913561.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a
 2 x speed up on rare term searches
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295748#comment-13295748 ] 

Michael McCandless commented on LUCENE-4069:
--------------------------------------------

I ran a benchmark on 10 M Wikipedia index; for the factory I used createSetBasedOnMemory and passed it 100 MB; I think that's enough to ensure we get the 10% saturation on save ...:

{noformat}
                Task    QPS base StdDev base   QPS bloomStdDev bloom      Pct diff
              Fuzzy1      102.47        3.67       41.95        0.78  -61% -  -56%
              Fuzzy2       38.36        1.76       18.68        0.37  -54% -  -47%
             Respell       89.89        4.38       44.09        0.52  -53% -  -47%
            Wildcard       40.48        2.82       36.20        0.64  -17% -   -2%
        SloppyPhrase        7.96        0.28        8.07        0.07   -3% -    5%
             Prefix3       61.94        5.34       63.35        0.37   -6% -   12%
        TermBGroup1M       71.37        6.79       73.73        1.55   -7% -   16%
          AndHighMed       64.09        5.51       66.73        1.75   -6% -   16%
      TermBGroup1M1P       49.55        3.78       51.75        2.67   -7% -   18%
         AndHighHigh       16.05        1.12       16.77        0.53   -5% -   15%
         TermGroup1M       35.87        3.07       37.56        0.74   -5% -   16%
          OrHighHigh        9.60        1.38       10.15        0.65  -13% -   31%
           OrHighMed       11.93        1.91       12.63        0.93  -15% -   35%
              IntNRQ        9.12        1.25        9.68        0.11   -7% -   24%
                Term      154.55       19.60      165.32        0.97   -5% -   23%
              Phrase       11.40        0.33       12.21        0.18    2% -   11%
            SpanNear        4.31        0.07        4.73        0.03    7% -   12%
            PKLookup      122.78        1.42      145.95        5.22   13% -   24%
{noformat}

Baseline is Lucene40 PostingsFormat even for the id field ... so PKLookup gets a good improvement.  This is on an index w/ 5 segments at each level.

Other queries seem to speed up as well (eg Term, Or*).

The queries that rely on Terms.intersect got much worse: is the BloomFilteredFieldsProducer should just pass through intersect to the delegate?
                
> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6, 4.0
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 4.0, 3.6.1
>
>         Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org