Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <AE6328DA-C100-4509-A2D4-81A33A80B22C@gmx.net>
References: <BFC6AEDF-5688-4E18-ABB8-36F76F6A66A2@gmx.net>
	<AANLkTin0KxX1LZ-xT1HH-AxC8m1Zge0H+vnCCmf_ELEK@mail.gmail.com>
	<AE6328DA-C100-4509-A2D4-81A33A80B22C@gmx.net>
Date: Wed, 27 Oct 2010 16:37:15 +0000
Message-ID: <AANLkTinyuuyeAyUB_q3M7OViuGxUzXEc0uBJXCrxrUdC@mail.gmail.com>
Subject: Re: High BloomFilterFalseRation
From: Mike Malone <mike@simplegeo.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=20cf301cc39a91f12704939bd6d5

--20cf301cc39a91f12704939bd6d5
Content-Type: text/plain; charset=ISO-8859-1

I think he was asking about queries, not data. The data may be randomly
distributed by way of a hash on the key, but if your queries are heavily
skewed (e.g., if you query for "foo" a lot more than "foo/bar", and "foo"
randomly happens to trigger a false positive) the skew in your query pattern
could cause a seemingly strange spike in false positives.

With a hierarchical data model it's not unlikely that this sort of skew
exists since you'd tend to query for items towards the root of the hierarchy
more frequently.

Mike

On Wed, Oct 27, 2010 at 2:14 PM, Daniel Doubleday
<daniel.doubleday@gmx.net>wrote:

> Hm -
>
> not sure if I understand the random question. We are using RP. But I
> wouldn't know why that should matter.
> I thought that the bloom filter hash function should evenly distribute no
> matter what keys come in.
>
> Keys are '/' separated strings (aka paths :-))
>
> I do bulk inserts like: (1000 rows at a time, with ~ 50 cols each)
>
> [
>        {'a/b/foo': cols},
>        {'a/b/bar': cols},
>        {'a/b/baz': cols}
> ]
>
> and before that I would query for 'a/b'. Recursively as in mkdir -p
>
> If parent paths are missing they would be inserted with the bulk insert.
>
> The value for BloomFilterFalseRatio has been in the range of 0.19 - 0.59 in
> the last couple of hours. Mostly around 0.3
>
> We're on 0.6.6 btw
>
>
> On Oct 27, 2010, at 3:58 PM, Jonathan Ellis wrote:
>
> > This is not expected, no.  How random are your queries?  If you have a
> > couple outlier rows causing the false positives that are being queried
> > over and over then that could just be the luck of the draw.
> >
> > On Wed, Oct 27, 2010 at 5:24 AM, Daniel Doubleday
> > <daniel.doubleday@gmx.net> wrote:
> >> Hi people
> >>
> >> We are currently moving our second use case from mysql to cassandra.
> While importing the data (ongoing) I noticed that the BloomFilterFalseRation
> seems to be pretty high compared to another CF which is in used in
> production right now.
> >>
> >> Its a hierarchical data model and I cannot avoid to do a read before
> inserting multiple columns.
> >>
> >> I see a false positive ration of 0.28 while in my other CF it is
> 0.00025.
> >>
> >> The CF has 5 live sstables whiel I read that ratio. At that time I
> inserted ~ 200k rows with a total of 1M cols. Row keys are pretty large
> unfortunately (key.length() ~ 60)
> >>
> >> Just wanted to check if this value is to be expected.
> >>
> >>
> >>
> >> Thanks,
> >> Daniel
> >
> >
> >
> > --
> > Jonathan Ellis
> > Project Chair, Apache Cassandra
> > co-founder of Riptano, the source for professional Cassandra support
> > http://riptano.com
>
>

--20cf301cc39a91f12704939bd6d5
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I think he was asking about queries, not data. The data may be randomly dis=
tributed by way of a hash on the key, but if your queries are heavily skewe=
d (e.g., if you query for &quot;foo&quot; a lot more than &quot;foo/bar&quo=
t;, and &quot;foo&quot; randomly happens to trigger a false positive) the s=
kew in your query pattern could cause a seemingly strange spike in false po=
sitives.<div>
<br></div><div>With a hierarchical data model it&#39;s not unlikely that th=
is sort of skew exists since you&#39;d tend to query for items towards the =
root of the hierarchy more frequently.<div><div><br></div><div>Mike<br>
<br><div class=3D"gmail_quote">On Wed, Oct 27, 2010 at 2:14 PM, Daniel Doub=
leday <span dir=3D"ltr">&lt;<a href=3D"mailto:daniel.doubleday@gmx.net">dan=
iel.doubleday@gmx.net</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_q=
uote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1e=
x;">
Hm -<br>
<br>
not sure if I understand the random question. We are using RP. But I wouldn=
&#39;t know why that should matter.<br>
I thought that the bloom filter hash function should evenly distribute no m=
atter what keys come in.<br>
<br>
Keys are &#39;/&#39; separated strings (aka paths :-))<br>
<br>
I do bulk inserts like: (1000 rows at a time, with ~ 50 cols each)<br>
<br>
[<br>
 =A0 =A0 =A0 =A0{&#39;a/b/foo&#39;: cols},<br>
 =A0 =A0 =A0 =A0{&#39;a/b/bar&#39;: cols},<br>
 =A0 =A0 =A0 =A0{&#39;a/b/baz&#39;: cols}<br>
]<br>
<br>
and before that I would query for &#39;a/b&#39;. Recursively as in mkdir -p=
<br>
<br>
If parent paths are missing they would be inserted with the bulk insert.<br=
>
<br>
The value for BloomFilterFalseRatio has been in the range of 0.19 - 0.59 in=
 the last couple of hours. Mostly around 0.3<br>
<br>
We&#39;re on 0.6.6 btw<br>
<div><div></div><div class=3D"h5"><br>
<br>
On Oct 27, 2010, at 3:58 PM, Jonathan Ellis wrote:<br>
<br>
&gt; This is not expected, no. =A0How random are your queries? =A0If you ha=
ve a<br>
&gt; couple outlier rows causing the false positives that are being queried=
<br>
&gt; over and over then that could just be the luck of the draw.<br>
&gt;<br>
&gt; On Wed, Oct 27, 2010 at 5:24 AM, Daniel Doubleday<br>
&gt; &lt;<a href=3D"mailto:daniel.doubleday@gmx.net">daniel.doubleday@gmx.n=
et</a>&gt; wrote:<br>
&gt;&gt; Hi people<br>
&gt;&gt;<br>
&gt;&gt; We are currently moving our second use case from mysql to cassandr=
a. While importing the data (ongoing) I noticed that the BloomFilterFalseRa=
tion seems to be pretty high compared to another CF which is in used in pro=
duction right now.<br>

&gt;&gt;<br>
&gt;&gt; Its a hierarchical data model and I cannot avoid to do a read befo=
re inserting multiple columns.<br>
&gt;&gt;<br>
&gt;&gt; I see a false positive ration of 0.28 while in my other CF it is 0=
.00025.<br>
&gt;&gt;<br>
&gt;&gt; The CF has 5 live sstables whiel I read that ratio. At that time I=
 inserted ~ 200k rows with a total of 1M cols. Row keys are pretty large un=
fortunately (key.length() ~ 60)<br>
&gt;&gt;<br>
&gt;&gt; Just wanted to check if this value is to be expected.<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; Thanks,<br>
&gt;&gt; Daniel<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; --<br>
&gt; Jonathan Ellis<br>
&gt; Project Chair, Apache Cassandra<br>
&gt; co-founder of Riptano, the source for professional Cassandra support<b=
r>
&gt; <a href=3D"http://riptano.com" target=3D"_blank">http://riptano.com</a=
><br>
<br>
</div></div></blockquote></div><br></div></div></div>

--20cf301cc39a91f12704939bd6d5--