cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Stupp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-8413) Bloom filter false positive ratio is not honoured
Date Wed, 24 Jun 2015 16:16:06 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14599653#comment-14599653
] 

Robert Stupp commented on CASSANDRA-8413:
-----------------------------------------

No worries.

Yes, _hasOldBfHashOrder_ sounds better so I've reversed the meaning of the boolean field during
rebase. The branch, which is considered WIP, still has a _BloomFilterTest.testBigBloomFilterFpc_
methods that compares conventional BF, new BF and guava's BF implementations WRT FPR.
Cassci's currently vetting.

> Bloom filter false positive ratio is not honoured
> -------------------------------------------------
>
>                 Key: CASSANDRA-8413
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8413
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benedict
>            Assignee: Robert Stupp
>             Fix For: 3.x
>
>         Attachments: 8413-patch.txt, 8413.hack-3.0.txt, 8413.hack.txt
>
>
> Whilst thinking about CASSANDRA-7438 and hash bits, I realised we have a problem with
sabotaging our bloom filters when using the murmur3 partitioner. I have performed a very quick
test to confirm this risk is real.
> Since a typical cluster uses the same murmur3 hash for partitioning as we do for bloom
filter lookups, and we own a contiguous range, we can guarantee that the top X bits collide
for all keys on the node. This translates into poor bloom filter distribution. I quickly hacked
LongBloomFilterTest to simulate the problem, and the result in these tests is _up to_ a doubling
of the actual false positive ratio. The actual change will depend on the key distribution,
the number of keys, the false positive ratio, the number of nodes, the token distribution,
etc. But seems to be a real problem for non-vnode clusters of at least ~128 nodes in size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message