cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Stupp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-9167) Improve bloom-filter false-positive-ratio
Date Tue, 26 Apr 2016 19:15:12 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-9167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258728#comment-15258728
] 

Robert Stupp commented on CASSANDRA-9167:
-----------------------------------------

Rebased and updated the patch to address the comments.
But I just removed the {{1.03}} fudge factor - otherwise, if the buckets-per-element is calculated
from the rounded hash code , it ends up with a too high false-positive-ratio (sometime 20%
and more).

This patch also requires a change to {{KeyCacheCqlTest}}, because it checks the number requests
against the key cache, which this patch reduces.

Also scheduled a couple of cstar tests to check for any notable change. Just [Trades|http://cstar.datastax.com/tests/id/b87cc382-0bc3-11e6-a0f0-0256e416528f]
shows a slight 5% improvement. The other tests [Read/Write|http://cstar.datastax.com/tests/id/e8e12b4e-0bc3-11e6-a0f0-0256e416528f]
and [3 MV|http://cstar.datastax.com/tests/id/0c8b8fe4-0bc4-11e6-a0f0-0256e416528f] show a
very marginal change, that I’d file under normal stress result variance.


> Improve bloom-filter false-positive-ratio
> -----------------------------------------
>
>                 Key: CASSANDRA-9167
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9167
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Robert Stupp
>            Assignee: Robert Stupp
>            Priority: Minor
>              Labels: perfomance
>
> {{org.apache.cassandra.utils.BloomCalculations}} performs some table lookups to calculate
the bloom filter specification (size, # of hashes). Using the exact maths for that computation
brings a better false-positive-ratio (the maths usually returns higher numbers for hash-counts).
> TL;DR increasing the number of hash-rounds brings a nice improvement. Finally it's a
trade-off between CPU and I/O.
> ||false-positive-chance||elements||capacity||hash count new||false-positive-ratio new||hash
count current||false-positive-ratio current||improvement
> |0.1|10000|50048|3|0.0848|3|0.0848|0
> |0.1|100000|500032|3|0.09203|3|0.09203|0
> |0.1|1000000|5000064|3|0.0919|3|0.0919|0
> |0.1|10000000|50000064|3|0.09182|3|0.09182|0
> |0.1|100000000|500000064|3|0.091874|3|0.091874|0
> |0.01|10000|100032|7|0.0092|5|0.0107|0.1630434783
> |0.01|100000|1000064|7|0.00818|5|0.00931|0.1381418093
> |0.01|1000000|10000064|7|0.008072|5|0.009405|0.1651387512
> |0.01|10000000|100000064|7|0.008174|5|0.009375|0.146929288
> |0.01|100000000|1000000064|7|0.008197|5|0.009428|0.150176894
> |0.001|10000|150080|10|0.0008|7|0.001|0.25
> |0.001|100000|1500032|10|0.0006|7|0.00094|0.5666666667
> |0.001|1000000|15000064|10|0.000717|7|0.000991|0.3821478382
> |0.001|10000000|150000064|10|0.000743|7|0.000992|0.33512786
> |0.001|100000000|1500000064|10|0.000741|7|0.001002|0.3522267206
> |0.0001|10000|200064|13|0|10|0.0002|#DIV/0!
> |0.0001|100000|2000064|13|0.00004|10|0.0001|1.5
> |0.0001|1000000|20000064|13|0.000075|10|0.000091|0.2133333333
> |0.0001|10000000|200000064|13|0.000069|10|0.000087|0.2608695652
> |0.0001|100000000|2000000064|13|0.000068|10|0.00009|0.3235294118
> If we decide to allow more hash-rounds, it could be nicely back-ported even to 2.0 without
affecting existing sstables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message