Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CA+VSrLpmgUTvyEgYCH72MTVh_F8MzS3rwOLgfZjgNyqZsUB9ww@mail.gmail.com>
References: 
 <CADjpmEF6jbACr+Qc0z5siJm7OQcYH_1sWa8gqDb08Yx12tH0Hg@mail.gmail.com>
	<CA+VSrLrYNKwdk9ORD4U2YizR8az1P2Xjsx-Ogzy8tXwUrsGsVA@mail.gmail.com>
	<2246AFD2-4F3B-4843-94BD-746C35779027@flipagram.com>
	<CA+VSrLpmgUTvyEgYCH72MTVh_F8MzS3rwOLgfZjgNyqZsUB9ww@mail.gmail.com>
Date: Thu, 14 Apr 2016 10:36:42 -0400
Message-ID: 
 <CAOxAL63u3P3njixoGsy7sBiqYHwqbwz0QYbD+6gZtFesCbvUxA@mail.gmail.com>
Subject: Re: Cassandra 2.1.12 Node size
From: Jack Krupansky <jack.krupansky@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=001a114304d6ae0173053072d21d

--001a114304d6ae0173053072d21d
Content-Type: text/plain; charset=UTF-8

The four criteria I would suggest for evaluating node size:

1. Query latency.
2. Query throughput/load
3. Repair time - worst case, full repair, what you can least afford if it
happens at the worst time
4. Expected growth over the next six to 18 months - you don't what to be
scrambling with latency, throughput, and repair problems when you bump into
a wall on capacity. 20% to 30% is a fair number.

Alas, it is very difficult to determine how much spare capacity you have,
other than an artificial, synthetic load test: Try 30% more clients and
queries with 30% more (synthetic) data and see what happens to query
latency, total throughput, and repair time. Run such a test periodically
(monthly) to get a heads-up when load is getting closer to a wall.

Incremental repair is great to streamline and optimize your day-to-day
operations, but focus attention on replacement of down nodes during times
of stress.


-- Jack Krupansky

On Thu, Apr 14, 2016 at 10:14 AM, Alain RODRIGUEZ <arodrime@gmail.com>
wrote:

> Would adding nodes be the right way to start if I want to get the data per
>> node down
>
>
> Yes, if everything else is fine, the last and always available option to
> reduce the disk size per node is to add new nodes. Sometimes it is the
> first option considered as it is relatively quick and quite strait forward.
>
> Again, 50 % of free disk space is not a hard limit. To give you a rough
> idea, if the biggest sstable is 100 GB big and you still have 400 GB free,
> you will probably be good to go, excepted if 4 compaction of 100 GB trigger
> at the same time, filling up the disk.
>
> Now is the good time to think of a plan to handle the growth for you, but
> don't worry if data reaches 60%, it will probably not be a big deal.
>
> You can make sure that:
>
> - There are no snapshots, heap dumps or data not related with C* taking
> some space
> - The biggest sstables tombstone ratio are not too high (are tombstones
> are correctly evicted ?)
> - You are using compression (if you want too)
>
> Consider:
>
> - Adding TTLs to data you don't want to keep forever, shorten TTLs as much
> as allowed.
> - Migrating to C*3.0+ and take advantage of the new engine storage
>
> C*heers,
> -----------------------
> Alain Rodriguez - alain@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
> 2016-04-14 15:41 GMT+02:00 Aiman Parvaiz <aiman@flipagram.com>:
>
>> Thanks for the response Alain. I am using STCS and would like to take
>> some action as we would be hitting 50% disk space pretty soon. Would adding
>> nodes be the right way to start if I want to get the data per node down
>> otherwise can you or someone on the list please suggest the right way to go
>> about it.
>>
>> Thanks
>>
>> Sent from my iPhone
>>
>> On Apr 14, 2016, at 5:17 PM, Alain RODRIGUEZ <arodrime@gmail.com> wrote:
>>
>> Hi,
>>
>> I seek advice in data size per node. Each of my node has close to 1 TB of
>>> data. I am not seeing any issues as of now but wanted to run it by you guys
>>> if this data size is pushing the limits in any manner and if I should be
>>> working on reducing data size per node.
>>
>>
>> There is no real limit to the data size other than 50% of the machine
>> disk space using STCS and 80 % if you are using LCS. Those are 'soft'
>> limits as it will depend on your biggest sstables size and the number of
>> concurrent compactions mainly, but to stay away from trouble, it is better
>> to keep things under control, below the limits mentioned above.
>>
>> I will me migrating to incremental repairs shortly and full repair as of
>>> now takes 20 hr/node. I am not seeing any issues with the nodes for now.
>>>
>>
>> As you noticed, you need to keep in mind that the larger the dataset is,
>> the longer operations will take. Repairs but also bootstrap or replace a
>> node, remove a node, any operation that require to stream data or read it.
>> Repair time can be mitigated by using incremental repairs indeed.
>>
>> I am running a 9 node C* 2.1.12 cluster.
>>>
>>
>> It should be quite safe to give incremental repair a try as many bugs
>> have been fixe in this version:
>>
>> FIX 2.1.12 - A lot of sstables using range repairs due to anticompaction
>> - incremental only
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-10422
>>
>> FIX 2.1.12 - repair hang when replica is down - incremental only
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-10288
>>
>> If you are using DTCS be aware of
>> https://issues.apache.org/jira/browse/CASSANDRA-11113
>>
>> If using LCS, watch closely sstable and compactions pending counts.
>>
>> As a general comment, I would say that Cassandra has evolved to be able
>> to handle huge datasets (memory structures off-heap + increase of heap size
>> using G1GC, JBOD, vnodes, ...). Today Cassandra works just fine with big
>> dataset. I have seen clusters with 4+ TB nodes and other using a few GB per
>> node. It all depends on your requirements and your machines spec. If fast
>> operations are absolutely necessary, keep it small. If you want to use the
>> entire disk space (50/80% of total disk space max), go ahead as long as
>> other resources are fine (CPU, memory, disk throughput, ...).
>>
>> C*heers,
>>
>> -----------------------
>> Alain Rodriguez - alain@thelastpickle.com
>> France
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>> 2016-04-14 10:57 GMT+02:00 Aiman Parvaiz <aiman@flipagram.com>:
>>
>>> Hi all,
>>> I am running a 9 node C* 2.1.12 cluster. I seek advice in data size per
>>> node. Each of my node has close to 1 TB of data. I am not seeing any issues
>>> as of now but wanted to run it by you guys if this data size is pushing the
>>> limits in any manner and if I should be working on reducing data size per
>>> node. I will me migrating to incremental repairs shortly and full repair as
>>> of now takes 20 hr/node. I am not seeing any issues with the nodes for now.
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>
>

--001a114304d6ae0173053072d21d
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">The four criteria I would suggest for evaluating node size=
:<div><br></div><div>1. Query latency.</div><div>2. Query throughput/load</=
div><div>3. Repair time - worst case, full repair, what you can least affor=
d if it happens at the worst time</div><div>4. Expected growth over the nex=
t six to 18 months - you don&#39;t what to be scrambling with latency, thro=
ughput, and repair problems when you bump into a wall on capacity. 20% to 3=
0% is a fair number.</div><div><br></div><div>Alas, it is very difficult to=
 determine how much spare capacity you have, other than an artificial, synt=
hetic load test: Try 30% more clients and queries with 30% more (synthetic)=
 data and see what happens to query latency, total throughput, and repair t=
ime. Run such a test periodically (monthly) to get a heads-up when load is =
getting closer to a wall.</div><div><br></div><div>Incremental repair is gr=
eat to streamline and optimize your day-to-day operations, but focus attent=
ion on replacement of down nodes during times of stress.</div><div><br></di=
v><div><br></div></div><div class=3D"gmail_extra"><br clear=3D"all"><div><d=
iv class=3D"gmail_signature"><div dir=3D"ltr">-- Jack Krupansky</div></div>=
</div>
<br><div class=3D"gmail_quote">On Thu, Apr 14, 2016 at 10:14 AM, Alain RODR=
IGUEZ <span dir=3D"ltr">&lt;<a href=3D"mailto:arodrime@gmail.com" target=3D=
"_blank">arodrime@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex"><div dir=3D"ltr"><span class=3D""><blockquote class=3D"gmail_quote=
" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color=
:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span style=3D"=
font-size:12.8px">Would adding nodes be the right way to start if I want to=
 get the data per node down</span></blockquote><div><br></div></span><div>Y=
es, if everything else is fine, the last and always available option to red=
uce the disk size per node is to add new nodes. Sometimes it is the first o=
ption considered as it is relatively quick and quite strait forward.</div><=
div><br></div><div>Again, 50 % of free disk space is not a hard limit. To g=
ive you a rough idea, if the biggest sstable is 100 GB big and you still ha=
ve 400 GB free, you will probably be good to go, excepted if 4 compaction o=
f 100 GB trigger at the same time, filling up the disk.</div><div><br></div=
><div>Now is the good time to think of a plan to handle the growth for you,=
 but don&#39;t worry if data reaches 60%, it will probably not be a big dea=
l.</div><div><br></div><div>You can make sure that:</div><div><br></div><di=
v>- There are no snapshots, heap dumps or data not related with C* taking s=
ome space</div><div>- The biggest sstables tombstone ratio are not too high=
 (are tombstones are correctly evicted ?)</div><div>- You are using compres=
sion (if you want too)<br></div><div><br></div><div>Consider:</div><div><br=
></div><div>- Adding TTLs to data you don&#39;t want to keep forever, short=
en TTLs as much as allowed.</div><div>- Migrating to C*3.0+ and take advant=
age of the new engine storage<br></div><span class=3D""><div><br></div><div=
>C*heers,</div><div><div>-----------------------</div><div>Alain Rodriguez =
- <a href=3D"mailto:alain@thelastpickle.com" target=3D"_blank">alain@thelas=
tpickle.com</a></div><div>France</div><div><br></div><div>The Last Pickle -=
 Apache Cassandra Consulting</div><div><a href=3D"http://www.thelastpickle.=
com" target=3D"_blank">http://www.thelastpickle.com</a></div></div><div><br=
></div></span></div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"g=
mail_extra"><br><div class=3D"gmail_quote">2016-04-14 15:41 GMT+02:00 Aiman=
 Parvaiz <span dir=3D"ltr">&lt;<a href=3D"mailto:aiman@flipagram.com" targe=
t=3D"_blank">aiman@flipagram.com</a>&gt;</span>:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex"><div dir=3D"auto"><div>Thanks for the response Alain. I am using ST=
CS and would like to take some action as we would be hitting 50% disk space=
 pretty soon. Would adding nodes be the right way to start if I want to get=
 the data per node down otherwise can you or someone on the list please sug=
gest the right way to go about it.</div><div><br></div><div>Thanks<br><br>S=
ent from my iPhone</div><div><div><div><br>On Apr 14, 2016, at 5:17 PM, Ala=
in RODRIGUEZ &lt;<a href=3D"mailto:arodrime@gmail.com" target=3D"_blank">ar=
odrime@gmail.com</a>&gt; wrote:<br><br></div><blockquote type=3D"cite"><div=
><div dir=3D"ltr"><font face=3D"arial, helvetica, sans-serif">Hi,</font><di=
v><font face=3D"arial, helvetica, sans-serif"><br></font></div><blockquote =
class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1=
px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:=
1ex"><font face=3D"arial, helvetica, sans-serif">I seek advice in data size=
 per node. Each of my node has close to 1 TB of data. I am not seeing any i=
ssues as of now but wanted to run it by you guys if this data size is pushi=
ng the limits in any manner and if I should be working on reducing data siz=
e per node.</font></blockquote><div><font face=3D"arial, helvetica, sans-se=
rif"><br></font></div><div><font face=3D"arial, helvetica, sans-serif">Ther=
e is no real limit to the data size other than 50% of the machine disk spac=
e using STCS and 80 % if you are using LCS. Those are &#39;soft&#39; limits=
 as it will depend on your biggest sstables size and the number of concurre=
nt compactions mainly, but to stay away from trouble, it is better to keep =
things under control, below the limits mentioned above.</font></div><div><f=
ont face=3D"arial, helvetica, sans-serif"><br></font></div><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;b=
order-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"=
><font face=3D"arial, helvetica, sans-serif">I will me migrating to increme=
ntal repairs shortly and full repair as of now takes 20 hr/node. I am not s=
eeing any issues with the nodes for now.<br></font></blockquote><div><font =
face=3D"arial, helvetica, sans-serif"><br></font></div><div><font face=3D"a=
rial, helvetica, sans-serif">As you noticed, you need to keep in mind that =
the larger the dataset is, the longer operations will take. R</font><span s=
tyle=3D"font-family:arial,helvetica,sans-serif">epairs but also=C2=A0</span=
><span style=3D"font-family:arial,helvetica,sans-serif">bootstrap or replac=
e a node, remove a node, any operation that require to stream data or read =
it. Repair time can be mitigated by using incremental repairs indeed.=C2=A0=
</span></div><div><font face=3D"arial, helvetica, sans-serif"><br></font></=
div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;bor=
der-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:sol=
id;padding-left:1ex"><font face=3D"arial, helvetica, sans-serif">I am runni=
ng a 9 node C* 2.1.12 cluster.<br></font></blockquote><div><font face=3D"ar=
ial, helvetica, sans-serif"><br></font></div><div><font face=3D"arial, helv=
etica, sans-serif">It should be quite safe to give incremental repair a try=
 as many bugs have been fixe in this version:</font></div><div><font face=
=3D"arial, helvetica, sans-serif"><br></font></div><div>


<p><span><font face=3D"arial, helvetica, sans-serif">FIX 2.1.12 - A lot of =
sstables using range repairs due to anticompaction -=C2=A0incremental only<=
/font></span></p>
<p><span><a href=3D"https://issues.apache.org/jira/browse/CASSANDRA-10422" =
target=3D"_blank"><font face=3D"arial, helvetica, sans-serif">https://issue=
s.apache.org/jira/browse/CASSANDRA-10422</font></a></span></p>
<p><span><font face=3D"arial, helvetica, sans-serif">FIX 2.1.12 - repair ha=
ng when replica is down - incremental only</font></span></p>
<p><span><a href=3D"https://issues.apache.org/jira/browse/CASSANDRA-10288" =
target=3D"_blank"><font face=3D"arial, helvetica, sans-serif">https://issue=
s.apache.org/jira/browse/CASSANDRA-10288</font></a></span></p><p><font face=
=3D"arial, helvetica, sans-serif">If you are using DTCS be aware of <a href=
=3D"https://issues.apache.org/jira/browse/CASSANDRA-11113" target=3D"_blank=
">https://issues.apache.org/jira/browse/CASSANDRA-11113</a></font></p><p><f=
ont face=3D"arial, helvetica, sans-serif">If using LCS, watch closely sstab=
le and compactions pending counts.</font></p><p><font face=3D"arial, helvet=
ica, sans-serif">As a general comment, I would say that Cassandra has evolv=
ed to be able to handle huge datasets (memory structures off-heap + increas=
e of heap size using G1GC, JBOD, vnodes, ...). Today Cassandra works just f=
ine with big dataset. I have seen clusters with 4+ TB nodes and other using=
 a few GB per node. It all depends on your requirements and your machines s=
pec. If fast operations are absolutely necessary, keep it small. If you wan=
t to use the entire disk space (50/80% of total disk space max), go ahead a=
s long as other resources are fine (CPU, memory, disk throughput, ...).</fo=
nt></p><p><font face=3D"arial, helvetica, sans-serif">C*heers,</font></p><p=
><font face=3D"arial, helvetica, sans-serif">-----------------------<br></f=
ont><span style=3D"font-family:arial,helvetica,sans-serif">Alain Rodriguez =
- <a href=3D"mailto:alain@thelastpickle.com" target=3D"_blank">alain@thelas=
tpickle.com</a><br></span><span style=3D"font-family:arial,helvetica,sans-s=
erif">France</span></p><p><font face=3D"arial, helvetica, sans-serif">The L=
ast Pickle - Apache Cassandra Consulting<br></font><span style=3D"font-fami=
ly:arial,helvetica,sans-serif"><a href=3D"http://www.thelastpickle.com" tar=
get=3D"_blank">http://www.thelastpickle.com</a></span></p>


</div><div class=3D"gmail_extra"><font face=3D"arial, helvetica, sans-serif=
"><br></font><div class=3D"gmail_quote"><font face=3D"arial, helvetica, san=
s-serif">2016-04-14 10:57 GMT+02:00 Aiman Parvaiz <span dir=3D"ltr">&lt;<a =
href=3D"mailto:aiman@flipagram.com" target=3D"_blank">aiman@flipagram.com</=
a>&gt;</span>:<br></font><blockquote class=3D"gmail_quote" style=3D"margin:=
0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);=
border-left-style:solid;padding-left:1ex"><div dir=3D"ltr"><font face=3D"ar=
ial, helvetica, sans-serif">Hi all,</font><div><font face=3D"arial, helveti=
ca, sans-serif">I am running a 9 node C* 2.1.12 cluster. I seek advice in d=
ata size per node. Each of my node has close to 1 TB of data. I am not seei=
ng any issues as of now but wanted to run it by you guys if this data size =
is pushing the limits in any manner and if I should be working on reducing =
data size per node. I will me migrating to incremental repairs shortly and =
full repair as of now takes 20 hr/node. I am not seeing any issues with the=
 nodes for now.</font></div><div><font face=3D"arial, helvetica, sans-serif=
"><br></font></div><div><font face=3D"arial, helvetica, sans-serif">Thanks<=
/font></div><div><font face=3D"arial, helvetica, sans-serif"><br clear=3D"a=
ll"></font><div><font face=3D"arial, helvetica, sans-serif"><br></font></di=
v><font face=3D"arial, helvetica, sans-serif"><br>
</font></div></div>
</blockquote></div><br></div></div>
</div></blockquote></div></div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--001a114304d6ae0173053072d21d--