cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-6633) Dynamic Resize of Bloom Filters
Date Wed, 04 Feb 2015 14:26:36 GMT


Benedict commented on CASSANDRA-6633:

bq. I don't see how you can add separate spaces to that and still have a BF.

What gives you the impression bloom filters have to be stored in a uniform address space?
That is variant dependent. It marginally improves the false positive rate, however it is very
marginal, especially as the size grows. The advantage is we can dynamically change our false
positive ratio, especially under detection of severely underfulfilling our estimated ratio
(in the presence of commonly looked up records hitting it). Having completely distinct address
spaces for every hash function isn't requisite, though. We can have varying combinations of
overlapping functions in their own isolated uniform address spaces, so that we can mix and
match to get optimal space and false positive efficiency.

For reference, see the paper that the current variant we use is based upon: []
where it addresses the theoretical false positive ratio for "partitioned" bloom filters -
which is identical as the set grows to infinity - and the false positive ratios for various
combinations of bit and hash function counts. The effect is generally within 1% of the desired
false positive, but for very small sets and many hash functions, it can be larger, say 10%
(of the false positive). For such sets we can choose not to use this method, since they will
not occupy much memory anyway.

> Dynamic Resize of Bloom Filters
> -------------------------------
>                 Key: CASSANDRA-6633
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Priority: Minor
>              Labels: performance
>             Fix For: 3.0
> Dynamic resizing would be useful. The simplest way to achieve this is to have separate
address spaces for each hash function, so that we may increase/decrease accuracy by simply
loading/unloading another function (we could even do interesting stuff in future like alternating
the functions we select if we find we're getting more false positives than should be expected);
> Faster loading/unloading would help this, and we could achieve this by mmapping the bloom
filter representation on systems that we can mlock.

This message was sent by Atlassian JIRA

View raw message