cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sandeep Tata (JIRA)" <j...@apache.org>
Subject [jira] Commented: (CASSANDRA-68) Bloom filters have much higher false-positive rate than expected
Date Sat, 11 Apr 2009 23:31:14 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698145#action_12698145
] 

Sandeep Tata commented on CASSANDRA-68:
---------------------------------------

I ran some more tests, here's what I found:

Old code:
Test                                             MAX_FP Actual FP
FalsePositivesInt/Random     0.1            0.142
FalsePositivesInt/Random     0.01         (pass)
FalsePositivesInt/Random     0.001       (pass)
Words                                         0.1            0.15
Words                                         0.01          (pass)
Words                                         0.001        0.0013

New code:

FalsePositivesInt/Random     0.1           (pass)
FalsePositivesInt/Random     0.01         (pass)
FalsePositivesInt/Random     0.001       (pass)
Words                                         0.1            (pass)
Words                                         0.01          (pass)
Words                                         0.001        (pass)

The old bloomfilter certainly reports up to 50% more than expected false positive rates for
some cases. The new bloomfilter is more predictable, it always passes.

On my machine, some quick-n-crude tests show that the new bloom-filter is about 4x slower.
(I tested at FP rate = 0.01) . When you take into account the fact that the penalty for a
false positive at least 3 orders of magnitude more expensive than the actual hash calculation
(an FP usually means you'll end up hitting disk unnecessarily), it makes sense to use it even
when you set the FP rate to 0.001. It is even more useful at higher rates.

> Bloom filters have much higher false-positive rate than expected
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-68
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-68
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-r-m-unused-code-including-entire-CountingBloomFilte.patch,
0002-replace-JenkinsHash-w-MurmurHash.-its-hash-distrib.patch, 0003-rename-BloomFilter.fill-add.patch,
0004-rewrite-bloom-filters-to-use-murmur-hash-and-combina.patch, 0004-v3.patch, 0004a-tests.patch,
0004b-code.patch, 0005-switch-back-to-old-hash-generation-code-to-demonstra.patch, fp_test_for_old_code.patch,
fp_test_for_old_code_v2.patch, words.gz
>
>
> Gory details: http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message