lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
Date Mon, 18 Jun 2012 11:42:43 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13395830#comment-13395830
] 

Michael McCandless commented on LUCENE-4069:
--------------------------------------------

bq. Interesting results, Mike - thanks for taking the time to run them.

You're welcome!

{quote}
bq. BloomFilteredFieldsProducer should just pass through intersect to the delegate?

I have tried to make the BloomFilteredFieldsProducer get out of the way of the client app
and the delegate PostingsFormat as soon as it is safe to do so i.e. when the user is safely
focused on a non-filtered field. While there is a chance the client may end up making a call
to TermsEnum.seekExact(..) on a filtered field then I need to have a wrapper object in place
which is in a position to intercept this call. In all other method invocations I just end
up delegating calls so I wonder if all these extra method calls are the cause of the slowdown
you see e.g. when Fuzzy is enumerating over many terms. 
 The only other alternatives to endlessly wrapping in this way are:
 a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out for just this
one method.
 b) Mess around with byte-code manipulation techniques to weave in Bloom filtering(the sort
of thing I recall Hibernate resorts to)

Neither of these seem particularly appealing options so I think we may have to live with fuzzy+bloom
not being as fast as straight fuzzy.
{quote}

I think the fix is simple: you are not overriding Terms.intersect now,
in BloomFilteredTerms.  I think you should override it and immediately
delegate and then FuzzyN/Respell performance should be just as good as
Lucene40 codec.

bq. For completeness sake - I don't have access to your benchmarking code

All the benchmarking code is here: http://code.google.com/a/apache-extras.org/p/luceneutil/

I run it nightly (trunk) and publish the results here: http://people.apache.org/~mikemccand/lucenebench/

bq. but I would hope that PostingsFormat.fieldsProducer() isn't called more than once for
the same segment as that's where the Bloom filters get loaded from disk so there's inherent
cost there too. I can't imagine this is the case.

It's only called once on init'ing the SegmentReader (or at least it
better be!).

{quote}
BTW I've just finished a long-running set of tests which mixes up reads and writes here: http://goo.gl/KJmGv
 This benchmark represents how graph databases such as Neo4j use Lucene for an index when
loading (I typically use the Wikipedia links as a test set). I look to get a 3.5 x speed up
in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over the comparatively slower 3.6 codebase.
{quote}

Nice results!  It looks like bloom(3.6) is faster than bloom(4.0)?
Why is that...

Also I wonder why you see such sizable (3.5X speedup) gains on PK
lookup but in my benchmark I see only ~13% - 24%.  My index has 5
segments per level...

                
> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6, 4.0
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 4.0, 3.6.1
>
>         Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch,
PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order
to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments
but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" on the
end of the name to invoke special indexing/querying capability. Clearly a new Field or schema
declaration(!) would need adding to APIs to configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message