cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9167) Improve bloom-filter false-positive-ratio
Date Tue, 02 Feb 2016 12:22:39 GMT


Benedict commented on CASSANDRA-9167:

[~snazy]: sorry for completely letting this fester.  It looks good to me, the only question
is if we _want_ to trade CPU and memory-indirection costs.  In principle we do, since this
is to prevent disk accesses.  However for many of our benchmark in-memory workloads this is
probably only increasing our costs.  However the cost increase is likely minimal compared
to our other costs, so on balance with the present state of affairs I'd say this is a pretty
darn positive change.

That said, I think it would be neater if we could permit a degree of configurability, where
we weigh the marginal cost increase versus the marginal false positive gain for any added
hash function, and provide users the option of configuring the tradeoff between the two.

That also said, it's perhaps overthinking things for now, and this positive if unglamorous
change should have been included a long time ago IMO.

Assuming it goes in roughly as-is, a few nits/suggestions:

# Add a 'd' suffix to the literals
# Re-calculate the buckets per-element after rounding the hash count, instead of adding a
fudge-factor (after all, half the time that fudge factor is wasting space, the other half
it may be insufficient)?

> Improve bloom-filter false-positive-ratio
> -----------------------------------------
>                 Key: CASSANDRA-9167
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Robert Stupp
>            Assignee: Robert Stupp
>            Priority: Minor
>              Labels: perfomance
> {{org.apache.cassandra.utils.BloomCalculations}} performs some table lookups to calculate
the bloom filter specification (size, # of hashes). Using the exact maths for that computation
brings a better false-positive-ratio (the maths usually returns higher numbers for hash-counts).
> TL;DR increasing the number of hash-rounds brings a nice improvement. Finally it's a
trade-off between CPU and I/O.
> ||false-positive-chance||elements||capacity||hash count new||false-positive-ratio new||hash
count current||false-positive-ratio current||improvement
> |0.1|10000|50048|3|0.0848|3|0.0848|0
> |0.1|100000|500032|3|0.09203|3|0.09203|0
> |0.1|1000000|5000064|3|0.0919|3|0.0919|0
> |0.1|10000000|50000064|3|0.09182|3|0.09182|0
> |0.1|100000000|500000064|3|0.091874|3|0.091874|0
> |0.01|10000|100032|7|0.0092|5|0.0107|0.1630434783
> |0.01|100000|1000064|7|0.00818|5|0.00931|0.1381418093
> |0.01|1000000|10000064|7|0.008072|5|0.009405|0.1651387512
> |0.01|10000000|100000064|7|0.008174|5|0.009375|0.146929288
> |0.01|100000000|1000000064|7|0.008197|5|0.009428|0.150176894
> |0.001|10000|150080|10|0.0008|7|0.001|0.25
> |0.001|100000|1500032|10|0.0006|7|0.00094|0.5666666667
> |0.001|1000000|15000064|10|0.000717|7|0.000991|0.3821478382
> |0.001|10000000|150000064|10|0.000743|7|0.000992|0.33512786
> |0.001|100000000|1500000064|10|0.000741|7|0.001002|0.3522267206
> |0.0001|10000|200064|13|0|10|0.0002|#DIV/0!
> |0.0001|100000|2000064|13|0.00004|10|0.0001|1.5
> |0.0001|1000000|20000064|13|0.000075|10|0.000091|0.2133333333
> |0.0001|10000000|200000064|13|0.000069|10|0.000087|0.2608695652
> |0.0001|100000000|2000000064|13|0.000068|10|0.00009|0.3235294118
> If we decide to allow more hash-rounds, it could be nicely back-ported even to 2.0 without
affecting existing sstables.

This message was sent by Atlassian JIRA

View raw message