incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Doubleday <daniel.double...@gmx.net>
Subject Re: High BloomFilterFalseRation
Date Tue, 02 Nov 2010 08:28:49 GMT
Hi all

had some time yesterday to dig a lil deeper. And maybe this saves someone who made the same
mistake the time so ...

After trying to reproduce the problem in unit tests with the same data which led nowhere because
every single result was almost exactly what the math promised and incidentally stumbling upon
this one: http://sites.google.com/site/murmurhash/murmurhash2flaw thinking omg all is lost
... I finally found that everything is just fine.

Turns out that the jmx BloomFilterFalseRation simply does not show what I expected it to be.
I thought it would provide a quality measure how good the bloom filter works in terms of hit
rate. Which would be (Unnecessary File Lookups / Total Lookups) but it is ( False Positives
/ ( False + True Positives) ) which means it does not count all hits that where rejected by
the filter.

So if you would only ask for rows that do not exist this ration will always show 1.0

Meaning it is rather a measure of how many of your queries ask for non existing values.

Cheers,
Daniel
 

On Oct 28, 2010, at 1:10 PM, Daniel Doubleday wrote:

> Hi Ryan
> 
> I took a sample of one sstable (just flushed, not compacted). 
> 
> I compared 2 samples of sstables. One that is showing fine false positive ratios and
the problem one. 
> And yes both look the same to me. Both have the expected 15 buckets per row and the cardinality
of the bitsets are the same.
> 
> But I am pretty sure that it is indeed as suggested a problem with skewed query pattern.
I stopped the import and started a random read test and things look better.
> 
> I'll try to reproduce this with a patched cassandra to get more debug info to figure
out why this is happening. Because I still don't understand it.
> 
> Thanks for your time everyone
> 
> == Sample of problem CD ==
> 
> DATA FILE
> 
> file size: 68804626 bytes
> rows: 7432 
> 
> FILTER FILE
> 
> file size: 14013 bytes
> bloom filter bitset size: 111488
> bloom filter bitset cardinalaity: 54062
> 
> 
> == Sample of working CF ==
> 
> DATA FILE
> 
> file size: 110730565 bytes
> rows: 47432
> 
> FILTER FILE
> 
> file size: 96565 bytes
> bloom filter bitset size: 771904
> bloom filter bitset cardinalaity: 354610
> 
> 
> On Oct 27, 2010, at 6:41 PM, Ryan King wrote:
> 
>> On Wed, Oct 27, 2010 at 3:24 AM, Daniel Doubleday
>> <daniel.doubleday@gmx.net> wrote:
>>> Hi people
>>> 
>>> We are currently moving our second use case from mysql to cassandra. While importing
the data (ongoing) I noticed that the BloomFilterFalseRation seems to be pretty high compared
to another CF which is in used in production right now.
>>> 
>>> Its a hierarchical data model and I cannot avoid to do a read before inserting
multiple columns.
>>> 
>>> I see a false positive ration of 0.28 while in my other CF it is 0.00025.
>>> 
>>> The CF has 5 live sstables whiel I read that ratio. At that time I inserted ~
200k rows with a total of 1M cols. Row keys are pretty large unfortunately (key.length() ~
60)
>>> 
>>> Just wanted to check if this value is to be expected.
>> 
>> This is not expected. How big are the bloom filters on disk?
>> 
>> -ryan
> 


Mime
View raw message