incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prashant Malik <pma...@gmail.com>
Subject Re: [jira] Commented: (CASSANDRA-68) Bloom filters have much higher false-positive rate than expected
Date Sun, 12 Apr 2009 02:53:22 GMT
The results are a bit counter intuitive here I would have expected it to be
faster with the same FP rate but   I am not sure why it is slower if you are
just using a couple of hash functions and using double hashing.

We had done these experiments a while back and found results matching with
the theoretical tables.

I am sorry I haven't looked at the test code but have you tried it with
large strings as keys ? e.g 128 byte keys , also with Longs.

But this is an interesting exercise for sure :).

Thanks
Prashant

On Sat, Apr 11, 2009 at 4:31 PM, Sandeep Tata (JIRA) <jira@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/CASSANDRA-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698145#action_12698145]
>
> Sandeep Tata commented on CASSANDRA-68:
> ---------------------------------------
>
> I ran some more tests, here's what I found:
>
> Old code:
> Test                                             MAX_FP Actual FP
> FalsePositivesInt/Random     0.1            0.142
> FalsePositivesInt/Random     0.01         (pass)
> FalsePositivesInt/Random     0.001       (pass)
> Words                                         0.1            0.15
> Words                                         0.01          (pass)
> Words                                         0.001        0.0013
>
> New code:
>
> FalsePositivesInt/Random     0.1           (pass)
> FalsePositivesInt/Random     0.01         (pass)
> FalsePositivesInt/Random     0.001       (pass)
> Words                                         0.1            (pass)
> Words                                         0.01          (pass)
> Words                                         0.001        (pass)
>
> The old bloomfilter certainly reports up to 50% more than expected false
> positive rates for some cases. The new bloomfilter is more predictable, it
> always passes.
>
> On my machine, some quick-n-crude tests show that the new bloom-filter is
> about 4x slower. (I tested at FP rate = 0.01) . When you take into account
> the fact that the penalty for a false positive at least 3 orders of
> magnitude more expensive than the actual hash calculation (an FP usually
> means you'll end up hitting disk unnecessarily), it makes sense to use it
> even when you set the FP rate to 0.001. It is even more useful at higher
> rates.
>
> > Bloom filters have much higher false-positive rate than expected
> > ----------------------------------------------------------------
> >
> >                 Key: CASSANDRA-68
> >                 URL: https://issues.apache.org/jira/browse/CASSANDRA-68
> >             Project: Cassandra
> >          Issue Type: Bug
> >            Reporter: Jonathan Ellis
> >            Assignee: Jonathan Ellis
> >             Fix For: 0.3
> >
> >         Attachments:
> 0001-r-m-unused-code-including-entire-CountingBloomFilte.patch,
> 0002-replace-JenkinsHash-w-MurmurHash.-its-hash-distrib.patch,
> 0003-rename-BloomFilter.fill-add.patch,
> 0004-rewrite-bloom-filters-to-use-murmur-hash-and-combina.patch,
> 0004-v3.patch, 0004a-tests.patch, 0004b-code.patch,
> 0005-switch-back-to-old-hash-generation-code-to-demonstra.patch,
> fp_test_for_old_code.patch, fp_test_for_old_code_v2.patch, words.gz
> >
> >
> > Gory details:
> http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message