lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
Date Tue, 17 Jul 2012 15:32:35 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13416290#comment-13416290
] 

Michael McCandless commented on LUCENE-4069:
--------------------------------------------

{quote}
bq. At a minimum I think before committing we should make the SegmentWriteState accessible.

OK. Will that be the subject of a new Jira?
{quote}

No, I mean we shouldn't commit this patch until SegmentWriteState is
accessible when creating the FuzzySet.  I think we can just pass it to
BloomFilterFactory.getSetForField?  This way if the app knows it's a
PK field then it can use maxDoc to always size an appropriate
bit set up front.

bq. I think we are in agreement on the broad principles. The fundamental question here though
is do you want to treat an index's choice of Hash algo as something that would require a new
SPI-registered PostingsFormat to decode or can that be handled as I have done here with a
general purpose SPI framework for hashing algos?

+1, that's exactly the question.

Ie, where to draw the line between "config of an existing PF" and
"different PF".

But I guess swapping in different hash impl should be seen as simple
config change, so I think using SPI to find it at read time is OK.

I still don't like how trappy this approach is: the default hardwired
(8 MB) can be way too big (silently slows down your NRT reopens,
especially if you bloom all fields) or way too small (silently turns
off bloom filter for fields that have too many unique terms).

I also don't think this PF should be per-field: we have
PerFieldPostingsFormat for that, and if there are limitations in PFPF,
we should address them rather than having to make all future PFs
handle per-field-ness themselves.  This PF should really handle one
field.

But I don't think these issues need to hold up commit (except for
making SegmentWriteState accessible)... we can improve over time.  I
think we may simply want to fold this into the terms dict somehow.

Can you add @lucene.experimental to all the new APIs?

                
> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6, 4.0-ALPHA
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch,
LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java,
PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order
to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments
but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" on the
end of the name to invoke special indexing/querying capability. Clearly a new Field or schema
declaration(!) would need adding to APIs to configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message