Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
MIME-Version: 1.0
In-Reply-To: <1274786933.10050110.1448387599589.JavaMail.yahoo@mail.yahoo.com>
References: 
 <CAOxAL619H0HcBth-tzEy3vE7TzJ3R=dgsFZiOBLDv92V6jkwqQ@mail.gmail.com>
	<1274786933.10050110.1448387599589.JavaMail.yahoo@mail.yahoo.com>
Date: Tue, 24 Nov 2015 15:06:01 -0500
Message-ID: 
 <CAOxAL62q_wX7sJBY+eFHv6j-gMUub4cb0JYKGx7Aq7Lj=4DjmQ@mail.gmail.com>
Subject: Re: Many keyspaces pattern
From: Jack Krupansky <jack.krupansky@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=001a1143f936ec197405254edea2

--001a1143f936ec197405254edea2
Content-Type: text/plain; charset=UTF-8

And DateTieredCompactionStrategy can be used to efficiently remove whole
sstables when the TTL expires, but this implies knowing what TTL to set in
advance.

I don't know if there are any tools to bulk delete older than a specific
age when DateTieredCompactionStrategy is used, but it might be a nice
feature.

-- Jack Krupansky

On Tue, Nov 24, 2015 at 12:53 PM, Saladi Naidu <naidusp2002@yahoo.com>
wrote:

> I can think of following features to solve
>
> 1. If you know the time period of after how long data should be removed
> then use TTL feature
> 2. Use Time Series to model the data and use inverted index to query the
> data by time period?
>
> Naidu Saladi
>
>
>
> On Tuesday, November 24, 2015 6:49 AM, Jack Krupansky <
> jack.krupansky@gmail.com> wrote:
>
>
> How often is sometimes - closer to 20% of the batches or 2%?
>
> How are you querying batches, both current and older ones?
>
> As always, your queries should drive your data models.
>
> If deleting a batch is very infrequent, maybe best to not do it and simply
> have logic in the app to ignore deleted batches - if your queries would
> reference them at all.
>
> What reasons would you have to delete a batch? Depending on the nature of
> the reason there may be an alternative.
>
> Make sure your cluster is adequately provisioned so that these expensive
> operations can occur in parallel to reduce their time and resources per
> node.
>
> Do all batches eventually get aged and deleted or are you expecting that
> most batches will live for many years to come? Have you planned for how you
> will grow the cluster over time?
>
> Maybe bite the bullet and use a background process to delete a batch if
> deletion is competing too heavily with query access - if they really need
> to be deleted at all.
>
> Number of keyspaces - and/or tables - should be limited to "low hundreds",
> and even then you are limited by RAM and CPU of each node. If a keyspace
> has 14 tables, then 250/14 = 20 would be a recommended upper limit for
> number of key spaces. Even if your total number of tables was under 300 or
> even 200, you would need to do a proof of concept implementation to verify
> that your specific data works well on your specific hardware.
>
>
> -- Jack Krupansky
>
> On Tue, Nov 24, 2015 at 5:05 AM, Jonathan Ballet <jballet@edgelab.ch>
> wrote:
>
> Hi,
>
> we are running an application which produces every night a batch with
> several hundreds of Gigabytes of data. Once a batch has been computed, it
> is never modified (nor updates nor deletes), we just keep producing new
> batches every day.
>
> Now, we are *sometimes* interested to remove a complete specific batch
> altogether. At the moment, we are accumulating all these data into only one
> keyspace which has a batch ID column in all our tables which is also part
> of the primary key. A sample table looks similar to this:
>
>   CREATE TABLE computation_results (
>       batch_id int,
>       id1 int,
>       id2 int,
>       value double,
>       PRIMARY KEY ((batch_id, id1), id2)
>   ) WITH CLUSTERING ORDER BY (id2 ASC);
>
> But we found out it is very difficult to remove a specific batch as we
> need to know all the IDs to delete the entries and it's both time and
> resource consuming (ie. it takes a long time and I'm not sure it's going to
> scale at all.)
>
> So, we are currently looking into having each of our batches in a keyspace
> of their own so removing a batch is merely equivalent to delete a keyspace.
> Potentially, it means we will end up having several hundreds of keyspaces
> in one cluster, although most of the time only the very last one will be
> used (we might still want to access the older ones, but that would be a
> very seldom use-case.) At the moment, the keyspace has about 14 tables and
> is probably not going to evolve much.
>
>
> Are there any counter-indications of using lot of keyspaces (300+) into
> one Cassandra cluster?
> Are there any good practices that we should follow?
> After reading the "Anti-patterns in Cassandra > Too many keyspaces or
> tables", does it mean we should plan ahead to already split our keyspace
> among several clusters?
>
> Finally, would there be any other way to achieve what we want to do?
>
> Thanks for your help!
>
>  Jonathan
>
>
>
>
>

--001a1143f936ec197405254edea2
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">And DateTieredCompactionStrategy can be used to efficientl=
y remove whole sstables when the TTL expires, but this implies knowing what=
 TTL to set in advance.<br><div><br></div><div>I don&#39;t know if there ar=
e any tools to bulk delete older than a specific age when=C2=A0DateTieredCo=
mpactionStrategy is used, but it might be a nice feature.</div></div><div c=
lass=3D"gmail_extra"><br clear=3D"all"><div><div class=3D"gmail_signature">=
<div dir=3D"ltr">-- Jack Krupansky</div></div></div>
<br><div class=3D"gmail_quote">On Tue, Nov 24, 2015 at 12:53 PM, Saladi Nai=
du <span dir=3D"ltr">&lt;<a href=3D"mailto:naidusp2002@yahoo.com" target=3D=
"_blank">naidusp2002@yahoo.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div><div style=3D"color:#000;background-color:#fff;font-fami=
ly:HelveticaNeue,Helvetica Neue,Helvetica,Arial,Lucida Grande,sans-serif;fo=
nt-size:14px"><div dir=3D"ltr">I can think of following features to solve</=
div><div dir=3D"ltr"><br></div><div dir=3D"ltr">1. If you know the time per=
iod of after how long data should be removed then use TTL feature</div><div=
 dir=3D"ltr">2. Use Time Series to model the data and use inverted index to=
 query the data by time period?</div><span class=3D"HOEnZb"><font color=3D"=
#888888"><div></div><div>=C2=A0</div><div><span style=3D"color:rgb(64,127,0=
);font-family:lucida console,sans-serif">Naidu Saladi <br></span></div></fo=
nt></span><div><div class=3D"h5"> <br><div><br><br></div><div style=3D"disp=
lay:block"> <div style=3D"font-family:HelveticaNeue,Helvetica Neue,Helvetic=
a,Arial,Lucida Grande,sans-serif;font-size:14px"> <div style=3D"font-family=
:HelveticaNeue,Helvetica Neue,Helvetica,Arial,Lucida Grande,sans-serif;font=
-size:16px"> <div dir=3D"ltr"><font size=3D"2" face=3D"Arial"> On Tuesday, =
November 24, 2015 6:49 AM, Jack Krupansky &lt;<a href=3D"mailto:jack.krupan=
sky@gmail.com" target=3D"_blank">jack.krupansky@gmail.com</a>&gt; wrote:<br=
></font></div>  <br><br> <div><div><div><div dir=3D"ltr">How often is somet=
imes - closer to 20% of the batches or 2%?<div><br clear=3D"none"></div><di=
v>How are you querying batches, both current and older ones?</div><div><br =
clear=3D"none"></div><div>As always, your queries should drive your data mo=
dels.</div><div><br clear=3D"none"></div><div>If deleting a batch is very i=
nfrequent, maybe best to not do it and simply have logic in the app to igno=
re deleted batches - if your queries would reference them at all.</div><div=
><br clear=3D"none"></div><div>What reasons would you have to delete a batc=
h? Depending on the nature of the reason there may be an alternative.</div>=
<div><br clear=3D"none"></div><div>Make sure your cluster is adequately pro=
visioned so that these expensive operations can occur in parallel to reduce=
 their time and resources per node.</div><div><br clear=3D"none"></div><div=
>Do all batches eventually get aged and deleted or are you expecting that m=
ost batches will live for many years to come? Have you planned for how you =
will grow the cluster over time?</div><div><br clear=3D"none"></div><div>Ma=
ybe bite the bullet and use a background process to delete a batch if delet=
ion is competing too heavily with query access - if they really need to be =
deleted at all.</div><div><br clear=3D"none"></div><div>Number of keyspaces=
 - and/or tables - should be limited to &quot;low hundreds&quot;, and even =
then you are limited by RAM and CPU of each node. If a keyspace has 14 tabl=
es, then 250/14 =3D 20 would be a recommended upper limit for number of key=
 spaces. Even if your total number of tables was under 300 or even 200, you=
 would need to do a proof of concept implementation to verify that your spe=
cific data works well on your specific hardware.</div><div><br clear=3D"non=
e"></div></div><div><br clear=3D"all"><div><div><div dir=3D"ltr">-- Jack Kr=
upansky</div></div></div>
<br clear=3D"none"><div><div>On Tue, Nov 24, 2015 at 5:05 AM, Jonathan Ball=
et <span dir=3D"ltr">&lt;<a rel=3D"nofollow" shape=3D"rect" href=3D"mailto:=
jballet@edgelab.ch" target=3D"_blank">jballet@edgelab.ch</a>&gt;</span> wro=
te:<br clear=3D"none"><blockquote style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi,<br clear=3D"none">
<br clear=3D"none">
we are running an application which produces every night a batch with sever=
al hundreds of Gigabytes of data. Once a batch has been computed, it is nev=
er modified (nor updates nor deletes), we just keep producing new batches e=
very day.<br clear=3D"none">
<br clear=3D"none">
Now, we are *sometimes* interested to remove a complete specific batch alto=
gether. At the moment, we are accumulating all these data into only one key=
space which has a batch ID column in all our tables which is also part of t=
he primary key. A sample table looks similar to this:<br clear=3D"none">
<br clear=3D"none">
=C2=A0 CREATE TABLE computation_results (<br clear=3D"none">
=C2=A0 =C2=A0 =C2=A0 batch_id int,<br clear=3D"none">
=C2=A0 =C2=A0 =C2=A0 id1 int,<br clear=3D"none">
=C2=A0 =C2=A0 =C2=A0 id2 int,<br clear=3D"none">
=C2=A0 =C2=A0 =C2=A0 value double,<br clear=3D"none">
=C2=A0 =C2=A0 =C2=A0 PRIMARY KEY ((batch_id, id1), id2)<br clear=3D"none">
=C2=A0 ) WITH CLUSTERING ORDER BY (id2 ASC);<br clear=3D"none">
<br clear=3D"none">
But we found out it is very difficult to remove a specific batch as we need=
 to know all the IDs to delete the entries and it&#39;s both time and resou=
rce consuming (ie. it takes a long time and I&#39;m not sure it&#39;s going=
 to scale at all.)<br clear=3D"none">
<br clear=3D"none">
So, we are currently looking into having each of our batches in a keyspace =
of their own so removing a batch is merely equivalent to delete a keyspace.=
 Potentially, it means we will end up having several hundreds of keyspaces =
in one cluster, although most of the time only the very last one will be us=
ed (we might still want to access the older ones, but that would be a very =
seldom use-case.) At the moment, the keyspace has about 14 tables and is pr=
obably not going to evolve much.<br clear=3D"none">
<br clear=3D"none">
<br clear=3D"none">
Are there any counter-indications of using lot of keyspaces (300+) into one=
 Cassandra cluster?<br clear=3D"none">
Are there any good practices that we should follow?<br clear=3D"none">
After reading the &quot;Anti-patterns in Cassandra &gt; Too many keyspaces =
or tables&quot;, does it mean we should plan ahead to already split our key=
space among several clusters?<br clear=3D"none">
<br clear=3D"none">
Finally, would there be any other way to achieve what we want to do?<br cle=
ar=3D"none">
<br clear=3D"none">
Thanks for your help!<span><font color=3D"#888888"><br clear=3D"none">
<br clear=3D"none">
=C2=A0Jonathan<br clear=3D"none">
</font></span></blockquote></div></div><br clear=3D"none"></div></div></div=
><br><br></div>  </div> </div>  </div></div></div></div></div></blockquote>=
</div><br></div>

--001a1143f936ec197405254edea2--