cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Doubleday <daniel.double...@gmx.net>
Subject Re: High BloomFilterFalseRation
Date Wed, 27 Oct 2010 18:35:27 GMT
Ah of course - question makes total sense.

But no: this is not the case: I am not constantly asking the same 
question since the tree is deep enough. Most data nodes are level 5 from 
the root. So the parents getting queried will be different most of the time.

Since the parent nodes are created the queries stop there and don't 
propagate toward the root.

And I am seeing the high values all the time. Best that it gets is 0.15.

Daniel

On 27.10.10 18:37, Mike Malone wrote:
> I think he was asking about queries, not data. The data may be 
> randomly distributed by way of a hash on the key, but if your queries 
> are heavily skewed (e.g., if you query for "foo" a lot more than 
> "foo/bar", and "foo" randomly happens to trigger a false positive) the 
> skew in your query pattern could cause a seemingly strange spike in 
> false positives.
>
> With a hierarchical data model it's not unlikely that this sort of 
> skew exists since you'd tend to query for items towards the root of 
> the hierarchy more frequently.
>
> Mike
>
> On Wed, Oct 27, 2010 at 2:14 PM, Daniel Doubleday 
> <daniel.doubleday@gmx.net <mailto:daniel.doubleday@gmx.net>> wrote:
>
>     Hm -
>
>     not sure if I understand the random question. We are using RP. But
>     I wouldn't know why that should matter.
>     I thought that the bloom filter hash function should evenly
>     distribute no matter what keys come in.
>
>     Keys are '/' separated strings (aka paths :-))
>
>     I do bulk inserts like: (1000 rows at a time, with ~ 50 cols each)
>
>     [
>            {'a/b/foo': cols},
>            {'a/b/bar': cols},
>            {'a/b/baz': cols}
>     ]
>
>     and before that I would query for 'a/b'. Recursively as in mkdir -p
>
>     If parent paths are missing they would be inserted with the bulk
>     insert.
>
>     The value for BloomFilterFalseRatio has been in the range of 0.19
>     - 0.59 in the last couple of hours. Mostly around 0.3
>
>     We're on 0.6.6 btw
>
>
>     On Oct 27, 2010, at 3:58 PM, Jonathan Ellis wrote:
>
>     > This is not expected, no.  How random are your queries?  If you
>     have a
>     > couple outlier rows causing the false positives that are being
>     queried
>     > over and over then that could just be the luck of the draw.
>     >
>     > On Wed, Oct 27, 2010 at 5:24 AM, Daniel Doubleday
>     > <daniel.doubleday@gmx.net <mailto:daniel.doubleday@gmx.net>> wrote:
>     >> Hi people
>     >>
>     >> We are currently moving our second use case from mysql to
>     cassandra. While importing the data (ongoing) I noticed that the
>     BloomFilterFalseRation seems to be pretty high compared to another
>     CF which is in used in production right now.
>     >>
>     >> Its a hierarchical data model and I cannot avoid to do a read
>     before inserting multiple columns.
>     >>
>     >> I see a false positive ration of 0.28 while in my other CF it
>     is 0.00025.
>     >>
>     >> The CF has 5 live sstables whiel I read that ratio. At that
>     time I inserted ~ 200k rows with a total of 1M cols. Row keys are
>     pretty large unfortunately (key.length() ~ 60)
>     >>
>     >> Just wanted to check if this value is to be expected.
>     >>
>     >>
>     >>
>     >> Thanks,
>     >> Daniel
>     >
>     >
>     >
>     > --
>     > Jonathan Ellis
>     > Project Chair, Apache Cassandra
>     > co-founder of Riptano, the source for professional Cassandra support
>     > http://riptano.com
>
>


Mime
View raw message