cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Doubleday <daniel.double...@gmx.net>
Subject Re: High BloomFilterFalseRation
Date Wed, 27 Oct 2010 15:46:53 GMT
The key might be there if a prior insert forced the generation of the parent path.

Probability of misses is hard to tell. Zabbix tells me that I have 5 reads  and 100 writes
/ sec. Given the layout of the tree we are importing I guess the probability that the path
did not exist is pretty high. Like > 0.5

Datamodel

One Standard CF: BytesType
It resembles a file system.

Key: path

Columns:
0x00: byte[] # data of the file. might not exist if path contains no data (directory) (1k
- 10k)
0x01: metadata : protobuf (< 1k byte)
concat(0x7F, <latin1 bytes of child name>): copy of metadata of child # thats for listings
and bulk get of metadata for children

I tried to reproduce this with a 50k real paths and the cassandra FilterTest. But to no avail.
I'll always get 0 false positives. 
There's no way that murmur has a weakness with common prefixes right?
 

On Oct 27, 2010, at 4:28 PM, Jonathan Ellis wrote:

> Do you have a key "a/b" then?  What columns does it have?
> 
> On Wed, Oct 27, 2010 at 9:14 AM, Daniel Doubleday
> <daniel.doubleday@gmx.net> wrote:
>> Hm -
>> 
>> not sure if I understand the random question. We are using RP. But I wouldn't know
why that should matter.
>> I thought that the bloom filter hash function should evenly distribute no matter
what keys come in.
>> 
>> Keys are '/' separated strings (aka paths :-))
>> 
>> I do bulk inserts like: (1000 rows at a time, with ~ 50 cols each)
>> 
>> [
>>        {'a/b/foo': cols},
>>        {'a/b/bar': cols},
>>        {'a/b/baz': cols}
>> ]
>> 
>> and before that I would query for 'a/b'. Recursively as in mkdir -p
>> 
>> If parent paths are missing they would be inserted with the bulk insert.
>> 
>> The value for BloomFilterFalseRatio has been in the range of 0.19 - 0.59 in the last
couple of hours. Mostly around 0.3
>> 
>> We're on 0.6.6 btw
>> 
>> 
>> On Oct 27, 2010, at 3:58 PM, Jonathan Ellis wrote:
>> 
>>> This is not expected, no.  How random are your queries?  If you have a
>>> couple outlier rows causing the false positives that are being queried
>>> over and over then that could just be the luck of the draw.
>>> 
>>> On Wed, Oct 27, 2010 at 5:24 AM, Daniel Doubleday
>>> <daniel.doubleday@gmx.net> wrote:
>>>> Hi people
>>>> 
>>>> We are currently moving our second use case from mysql to cassandra. While
importing the data (ongoing) I noticed that the BloomFilterFalseRation seems to be pretty
high compared to another CF which is in used in production right now.
>>>> 
>>>> Its a hierarchical data model and I cannot avoid to do a read before inserting
multiple columns.
>>>> 
>>>> I see a false positive ration of 0.28 while in my other CF it is 0.00025.
>>>> 
>>>> The CF has 5 live sstables whiel I read that ratio. At that time I inserted
~ 200k rows with a total of 1M cols. Row keys are pretty large unfortunately (key.length()
~ 60)
>>>> 
>>>> Just wanted to check if this value is to be expected.
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> Daniel
>>> 
>>> 
>>> 
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of Riptano, the source for professional Cassandra support
>>> http://riptano.com
>> 
>> 
> 
> 
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com


Mime
View raw message