Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: aaron morton <aaron@thelastpickle.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_22937C26-F466-4AF7-B87F-4BB754025CC9"
Message-Id: <584110EF-2349-4785-91CA-25BF9EE21C1E@thelastpickle.com>
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
Subject: Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered
 compaction.
Date: Fri, 7 Dec 2012 16:43:04 +1300
References: 
 <CAADnm_eD5sctcBczW86RNdvTZPonFQaH-+izQAak_qO-VKr1Fw@mail.gmail.com>
 <CA+VSrLoE0xc=edgP_8v2nLaBpELro9Sxj_7sh3LLK8MSQPn5rA@mail.gmail.com>
 <A5A20A3D-BB85-48C7-B895-8A234B0C1230@thelastpickle.com>
 <CAADnm_ejNisz8h=nhrJ53DyrAwp=oBw1y5UTSBE5pVTWSSgewA@mail.gmail.com>
 <CBE9A43F-14CF-4879-A32C-3AC71D07469D@thelastpickle.com>
 <BEF40F3557A6B24CB87D1F17775EA4FD2ACC992C@FMSMSX101.amr.corp.intel.com>
 <CAENxBwzbf7G8D38m-eT-ppSCJ3w8A93E8rzXXQbuvtgtcGqAVw@mail.gmail.com>
To: user@cassandra.apache.org
In-Reply-To: 
 <CAENxBwzbf7G8D38m-eT-ppSCJ3w8A93E8rzXXQbuvtgtcGqAVw@mail.gmail.com>


--Apple-Mail=_22937C26-F466-4AF7-B87F-4BB754025CC9
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=windows-1252

> Meaning terabyte size databases.=20
>=20
Lots of people have TB sized systems. Just add more nodes.=20
300 to 400 Gb is just a rough guideline. The bigger picture is =
considering how routine and non routine maintenance tasks are going to =
be carried out.=20

Cheers
 =20
-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 7/12/2012, at 4:38 AM, Edward Capriolo <edlinuxguru@gmail.com> wrote:

> http://wiki.apache.org/cassandra/LargeDataSetConsiderations
>=20
>=20
> On Thu, Dec 6, 2012 at 9:53 AM, Poziombka, Wade L =
<wade.l.poziombka@intel.com> wrote:
> =93Having so much data on each node is a potential bad day.=94
>=20
> =20
>=20
> Is this discussed somewhere on the Cassandra documentation (limits, =
practices etc)?  We are also trying to load up quite a lot of data and =
have hit memory issues (bloom filter etc.) in 1.0.10.  I would like to =
read up on big data usage of Cassandra.  Meaning terabyte size =
databases.=20
>=20
> =20
>=20
> I do get your point about the amount of time required to recover =
downed node. But this 300-400MB business is interesting to me.
>=20
> =20
>=20
> Thanks in advance.
>=20
> =20
>=20
> Wade
>=20
> =20
>=20
> From: aaron morton [mailto:aaron@thelastpickle.com]=20
> Sent: Wednesday, December 05, 2012 9:23 PM
> To: user@cassandra.apache.org
> Subject: Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered =
compaction.
>=20
> =20
>=20
> Basically we were successful on two of the nodes. They both took ~2 =
days and 11 hours to complete and at the end we saw one very large file =
~900GB and the rest much smaller (the overall size decreased). This is =
what we expected!
>=20
> I would recommend having up to 300MB to 400MB per node on a regular =
HDD with 1GB networking.=20
>=20
> =20
>=20
> But on the 3rd node, we suspect major compaction didn't actually =
finish it's job=85
>=20
> The file list looks odd. Check the time stamps, on the files. You =
should not have files older than when compaction started.=20
>=20
> =20
>=20
> 8GB heap=20
>=20
> The default is 4GB max now days.=20
>=20
> =20
>=20
> 1) Do you expect problems with the 3rd node during 2 weeks more of =
operations, in the conditions seen below?=20
>=20
> I cannot answer that.=20
>=20
> =20
>=20
> 2) Should we restart with leveled compaction next year?=20
>=20
> I would run some tests to see how it works for you workload.=20
>=20
> =20
>=20
> 4) Should we consider increasing the cluster capacity?
>=20
> IMHO yes.
>=20
> You may also want to do some experiments with turing compression on if =
it not already enabled.=20
>=20
> =20
>=20
> Having so much data on each node is a potential bad day. If instead =
you had to move or repair one of those nodes how long would it take for =
cassandra to stream all the data over ? (Or to rsync the data over.) How =
long does it take to run nodetool repair on the node ?
>=20
> =20
>=20
> With RF 3, if you lose a node you have lost your redundancy. It's =
important to have a plan about how to get it back and how long it may =
take.  =20
>=20
> =20
>=20
> Hope that helps.=20
>=20
> =20
>=20
> -----------------
>=20
> Aaron Morton
>=20
> Freelance Cassandra Developer
>=20
> New Zealand
>=20
> =20
>=20
> @aaronmorton
>=20
> http://www.thelastpickle.com
>=20
> =20
>=20
> On 6/12/2012, at 3:40 AM, Alexandru Sicoe <adsicoe@gmail.com> wrote:
>=20
>=20
>=20
>=20
> Hi guys,
> Sorry for the late follow-up but I waited to run major compactions on =
all 3 nodes at a time before replying with my findings.
>=20
> Basically we were successful on two of the nodes. They both took ~2 =
days and 11 hours to complete and at the end we saw one very large file =
~900GB and the rest much smaller (the overall size decreased). This is =
what we expected!
>=20
> But on the 3rd node, we suspect major compaction didn't actually =
finish it's job. First of all nodetool compact returned much earlier =
than the rest - after one day and 15 hrs. Secondly from the 1.4TBs =
initially on the node only about 36GB were freed up (almost the same =
size as before). Saw nothing in the server log (debug not enabled). =
Below I pasted some more details about file sizes before and after =
compaction on this third node and disk occupancy.
>=20
> The situation is maybe not so dramatic for us because in less than 2 =
weeks we will have a down time till after the new year. During this we =
can completely delete all the data in the cluster and start fresh with =
TTLs for 1 month (as suggested by Aaron and 8GB heap as suggested by =
Alain - thanks).
>=20
> Questions:
>=20
> 1) Do you expect problems with the 3rd node during 2 weeks more of =
operations, in the conditions seen below?=20
> [Note: we expect the minor compactions to continue building up files =
but never really getting to compacting the large file and thus not =
needing much temporarily extra disk space].
>=20
> 2) Should we restart with leveled compaction next year?=20
> [Note: Aaron was right, we have 1 week rows which get deleted after 1 =
month which means older rows end up in big files =3D> to free up space =
with SizeTiered we will have no choice but run major compactions which =
we don't know if they will work provided that we get at ~1TB / node / 1 =
month. You can see we are at the limit!]
>=20
> 3) In case we keep SizeTiered:
>=20
>     - How can we improve the performance of our major compactions? (we =
left all config parameters as default). Would increasing compactions =
throughput interfere with writes and reads? What about multi-threaded =
compactions?
>=20
>     - Do we still need to run regular repair operations as well? Do =
these also do a major compaction or are they completely separate =
operations?=20
>=20
> [Note: we have 3 nodes with RF=3D2 and inserting at consistency level =
1 and reading at consistency level ALL. We read primarily for exporting =
reasons - we export 1 week worth of data at a time].
>=20
> 4) Should we consider increasing the cluster capacity?
> [We generate ~5million new rows every week which shouldn't come close =
to the hundreds of millions of rows on a node mentioned by Aaron which =
are the volumes that would create problems with bloom filters and =
indexes].
>=20
> Cheers,
> Alex
> ------------------
>=20
> The situation in the data folder=20
>=20
>     before calling nodetool comapact:
>=20
> du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
> 444G    =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
> 376G    =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-46431-Data.db
> 305G    =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-68959-Data.db
> 39G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-7352-Data.db
> 78G     =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-74076-Data.db
> 81G     =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-79663-Data.db
> 205M    =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80370-Data.db
> 20G     =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80968-Data.db
> 20G     =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-82330-Data.db
> 20G     =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-83710-Data.db
> 4.9G    =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84015-Data.db
> 4.9G    =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84356-Data.db
> 4.9G    =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84696-Data.db
> 333M    =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84707-Data.db
> 92M     =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84712-Data.db
> 92M     =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84717-Data.db
> 99M     =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84722-Data.db
> 2.5G    =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-tmp-he-84723-Data.db
> 1.4T    total
>=20
>     after nodetool comapact returned:
>=20
> du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
> 444G    =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
> 910G    =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84723-Data.db
> 19G     =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-86229-Data.db
> 19G     =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87639-Data.db
> 5.0G    =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87923-Data.db
> 4.8G    =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88261-Data.db
> 338M    =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88271-Data.db
> 339M    =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88292-Data.db
> 339M    =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88312-Data.db
> 98M =20
>=20
>=20
> Looking at the disk occupancy for the logical partition where the data =
folder is in:
>=20
> df /data_bst
> Filesystem           1K-blocks      Used Available Use% Mounted on
> /dev/sdb1            2927242720 1482502260 1444740460  51% /data_bst
>=20
>=20
> and the situation in the cluster
>=20
> nodetool -h $HOSTNAME ring (before major compaction)
> Address         DC          Rack        Status State   Load            =
Effective-Ownership Token                                      =20
>                                                                        =
                    113427455640312821154458202477256070484    =20
> 10.146.44.17    datacenter1 rack1       Up     Normal  1.37 TB         =
66.67%              0                                          =20
> 10.146.44.18    datacenter1 rack1       Up     Normal  1.04 TB         =
66.67%              56713727820156410577229101238628035242     =20
> 10.146.44.32    datacenter1 rack1       Up     Normal  1.14 TB         =
66.67%              113427455640312821154458202477256070484
>=20
> nodetool -h $HOSTNAME ring (after major compaction) (Note we were =
inserting data in the meantime)
> Address         DC          Rack        Status State   Load            =
Effective-Ownership Token                                      =20
>                                                                        =
                    113427455640312821154458202477256070484    =20
> 10.146.44.17    datacenter1 rack1       Up     Normal  1.38 TB         =
66.67%              0                                          =20
> 10.146.44.18    datacenter1 rack1       Up     Normal  1.08 TB         =
66.67%              56713727820156410577229101238628035242     =20
> 10.146.44.32    datacenter1 rack1       Up     Normal  1.19 TB         =
66.67%              113427455640312821154458202477256070484
>=20
>=20
> =20
>=20
> On Fri, Nov 23, 2012 at 2:16 AM, aaron morton =
<aaron@thelastpickle.com> wrote:
>=20
> >  =46rom what I know having too much data on one node is bad, not =
really sure why, but  I think that performance will go down due to the =
size of indexes and bloom filters (I may be wrong on the reasons but I'm =
quite sure you can't store too much data per node).
>=20
> If you have many hundreds of millions of rows on a node the memory =
needed for bloom filters and index sampling can be significant. These =
can both be tuned.
>=20
> If you have 1.1T per node the time to do a compaction, repair or =
upgrade may be very significant. Also the time taken to copy this data =
should you need to remove or replace a node may be prohibitive.
>=20
>=20
> > 2. Switch to Leveled compaction strategy.
>=20
> I would avoid making a change like that on an unstable / at risk =
system.
>=20
> > - Our usage pattern is write once, read once (export) and delete =
once!
>=20
>  The column TTL may be of use to you, it removes the need to do a =
delete.
>=20
> > - We were thinking of relying on the automatic minor compactions to =
free up space for us but as..
> There are some usage patterns which make life harder for STS. For =
example if you have very long lived rows that are written to and deleted =
a lot. Row fragments that have been around for a while will end up in =
bigger files, and these files get compacted less often.
>=20
> In this situation, if you are running low on disk space and you think =
there is a lot of deleted data in there, I would run a major compaction. =
A word or warning though, if do this you will need to continue to do it =
regularly. Major compaction creates a single big file, that will not get =
compaction often. There are ways to resolve this, and moving to LDB may =
help in the future.
>=20
> If you are stuck and worried about disk space it's what I would do. =
Once you are stable again then look at LDB =
http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
>=20
> Cheers
>=20
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>=20
> @aaronmorton
> http://www.thelastpickle.com
>=20
>=20
> On 23/11/2012, at 9:18 AM, Alain RODRIGUEZ <arodrime@gmail.com> wrote:
>=20
> > Hi Alexandru,
> >
> > "We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 =
disk per node for the data dir and separate disk for the commitlog, 12 =
cores, 24 GB RAM"
> >
> > I think you should tune your architecture in a very different way. =
=46rom what I know having too much data on one node is bad, not really =
sure why, but  I think that performance will go down due to the size of =
indexes and bloom filters (I may be wrong on the reasons but I'm quite =
sure you can't store too much data per node).
> >
> > Anyway, I am 6 nodes with half of these resources (6 cores / 12GB) =
would be better if you have the choice.
> >
> > "(12GB to Cassandra heap)."
> >
> > The max heap recommanded is 8GB because if you use more than these =
8GB the Gc jobs will start decreasing your performance.
> >
> > "We now have 1.1 TB worth of data per node (RF =3D 2)."
> >
> > You should use RF=3D3 unless one out of consistency or SPOF  doesn't =
matter to you.
> >
> > With RF=3D2 you are obliged to write at CL.one to remove the single =
point of failure.
> >
> > "1. Start issuing regular major compactions (nodetool compact).
> >      - This is not recommended:
> >             - Stops minor compactions.
> >             - Major performance hit on node (very bad for us because =
need to be taking data all the time)."
> >
> > Actually, major compaction *does not* stop minor compactions. What =
happens is that due to the size of the size of the sstable that remains =
after your major compaction, it will never be compacted with the =
upcoming new sstables, and because of that, your read performance will =
go down until you run an other major compaction.
> >
> > "2. Switch to Leveled compaction strategy.
> >       - It is mentioned to help with deletes and disk space usage. =
Can someone confirm?"
> >
> > =46rom what I know, Leveled compaction will not free disk space. It =
will allow you to use a greater percentage of your total disk space (50% =
max for sized tier compaction vs about 80% for leveled compaction)
> >
> > "Our usage pattern is write once, read once (export) and delete =
once! "
> >
> > In this case, I think that leveled compaction fits your needs.
> >
> > "Can anyone suggest which (if any) is better? Are there better =
solutions?"
> >
> > Are your sstable compressed ? You have 2 types of built-in =
compression and you may use them depending on the model of each of your =
CF.
> >
> > see: =
http://www.datastax.com/docs/1.1/operations/tuning#configure-compression
> >
> > Alain
> >
> > 2012/11/22 Alexandru Sicoe <adsicoe@gmail.com>
> > We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 =
disk per node for the data dir and separate disk for the commitlog, 12 =
cores, 24 GB RAM (12GB to Cassandra heap).
> >
>=20
> =20
>=20
> =20
>=20
>=20


--Apple-Mail=_22937C26-F466-4AF7-B87F-4BB754025CC9
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=windows-1252

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dwindows-1252"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; =
"><blockquote type=3D"cite"><div class=3D"gmail_quote"><blockquote =
class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; =
border-left-width: 1px; border-left-style: solid; border-left-color: =
rgb(204, 204, 204); padding-left: 1ex; position: static; z-index: auto; =
"><div link=3D"blue" vlink=3D"purple" lang=3D"EN-US"><p =
class=3D"MsoNormal">Meaning terabyte size =
databases.&nbsp;</p></div></blockquote></div></blockquote><div>Lots of =
people have TB sized systems. Just add more nodes.&nbsp;</div><div>300 =
to 400 Gb is just a rough guideline. The bigger picture is considering =
how routine and non routine maintenance tasks are going to be carried =
out.&nbsp;</div><div><br></div><div>Cheers</div><div>&nbsp;&nbsp;</div><di=
v apple-content-edited=3D"true">
<div style=3D"color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
medium; font-style: normal; font-variant: normal; font-weight: normal; =
letter-spacing: normal; line-height: normal; orphans: 2; text-align: =
-webkit-auto; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; =
text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; =
border-spacing: 0px; -webkit-text-decorations-in-effect: none; =
-webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; =
font-size: medium; "><div style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; border-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; border-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; border-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
"><div>-----------------</div><div>Aaron Morton</div><div>Freelance =
Cassandra Developer</div><div>New =
Zealand</div><div><br></div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com">http://www.thelastpickle.com</a></di=
v></div></span></div></span></div></span></div></span></div>
</div>

<br><div><div>On 7/12/2012, at 4:38 AM, Edward Capriolo &lt;<a =
href=3D"mailto:edlinuxguru@gmail.com">edlinuxguru@gmail.com</a>&gt; =
wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite"><a =
href=3D"http://wiki.apache.org/cassandra/LargeDataSetConsiderations">http:=
//wiki.apache.org/cassandra/LargeDataSetConsiderations</a><br><br><br><div=
 class=3D"gmail_quote">On Thu, Dec 6, 2012 at 9:53 AM, Poziombka, Wade L =
<span dir=3D"ltr">&lt;<a href=3D"mailto:wade.l.poziombka@intel.com" =
target=3D"_blank">wade.l.poziombka@intel.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; =
border-left-width: 1px; border-left-style: solid; border-left-color: =
rgb(204, 204, 204); padding-left: 1ex; position: static; z-index: auto; =
">


<div link=3D"blue" vlink=3D"purple" lang=3D"EN-US">
<div><p class=3D"MsoNormal"><span =
style=3D"font-size:11pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&q=
uot;;color:rgb(31,73,125)">=93</span>Having so much data on each node is =
a potential bad day.=94<u></u><u></u></p><p =
class=3D"MsoNormal"><u></u>&nbsp;<u></u></p><p class=3D"MsoNormal">Is =
this discussed somewhere on the Cassandra documentation (limits, =
practices etc)?&nbsp; We are also trying to load up quite a lot of data =
and have hit memory issues (bloom filter etc.) in 1.0.10.&nbsp; I would =
like to read up on big data usage
 of Cassandra.&nbsp; Meaning terabyte size databases.&nbsp; =
<u></u><u></u></p><p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p><p =
class=3D"MsoNormal">I do get your point about the amount of time =
required to recover downed node. But this 300-400MB business is =
interesting to me.<u></u><u></u></p><p =
class=3D"MsoNormal"><u></u>&nbsp;<u></u></p><p class=3D"MsoNormal">Thanks =
in advance.<u></u><u></u></p><p =
class=3D"MsoNormal"><u></u>&nbsp;<u></u></p><p =
class=3D"MsoNormal">Wade<span =
style=3D"font-size:11pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&q=
uot;;color:rgb(31,73,125)"><u></u><u></u></span></p><p =
class=3D"MsoNormal"><span =
style=3D"font-size:11pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&q=
uot;;color:rgb(31,73,125)"><u></u>&nbsp;<u></u></span></p>
<div>
<div style=3D"border-width:1pt medium medium;border-style:solid none =
none;border-color:rgb(181,196,223) -moz-use-text-color =
-moz-use-text-color;padding:3pt 0in 0in"><p class=3D"MsoNormal"><b><span =
style=3D"font-size:10pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&qu=
ot;">From:</span></b><span =
style=3D"font-size:10pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&qu=
ot;"> aaron morton [mailto:<a href=3D"mailto:aaron@thelastpickle.com" =
target=3D"_blank">aaron@thelastpickle.com</a>]
<br>
<b>Sent:</b> Wednesday, December 05, 2012 9:23 PM<br>
<b>To:</b> <a href=3D"mailto:user@cassandra.apache.org" =
target=3D"_blank">user@cassandra.apache.org</a><br>
<b>Subject:</b> Re: Freeing up disk space on Cassandra 1.1.5 with =
Size-Tiered compaction.<u></u><u></u></span></p>
</div>
</div><p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
<blockquote style=3D"margin-top:5pt;margin-bottom:5pt"><p =
class=3D"MsoNormal">Basically we were successful on two of the nodes. =
They both took ~2 days and 11 hours to complete and at the end we saw =
one very large file ~900GB and the rest much smaller (the overall size =
decreased). This is what we expected!<u></u><u></u></p>

</blockquote><p class=3D"MsoNormal">I would recommend having up to 300MB =
to 400MB per node on a regular HDD with 1GB =
networking.&nbsp;<u></u><u></u></p>
<div><p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<blockquote style=3D"margin-top:5pt;margin-bottom:5pt"><p =
class=3D"MsoNormal">But on the 3rd node, we suspect major compaction =
didn't actually finish it's job=85<u></u><u></u></p>
</blockquote><p class=3D"MsoNormal">The file list looks odd. Check the =
time stamps, on the files. You should not have files older than when =
compaction started.&nbsp;<u></u><u></u></p>
</div>
<div><p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<blockquote style=3D"margin-top:5pt;margin-bottom:5pt"><p =
class=3D"MsoNormal">8GB heap&nbsp;<u></u><u></u></p>
</blockquote><p class=3D"MsoNormal">The default is 4GB max now =
days.&nbsp;<u></u><u></u></p>
</div>
<div><p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<blockquote style=3D"margin-top:5pt;margin-bottom:5pt"><p =
class=3D"MsoNormal">1) Do you expect problems with the 3rd node during 2 =
weeks more of operations, in the conditions seen =
below?&nbsp;<u></u><u></u></p>
</blockquote><p class=3D"MsoNormal">I cannot answer =
that.&nbsp;<u></u><u></u></p>
</div>
<div><p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<blockquote style=3D"margin-top:5pt;margin-bottom:5pt"><p =
class=3D"MsoNormal">2) Should we restart with leveled compaction next =
year?&nbsp;<u></u><u></u></p>
</blockquote><p class=3D"MsoNormal">I would run some tests to see how it =
works for you workload.&nbsp;<u></u><u></u></p>
</div>
<div><p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<blockquote style=3D"margin-top:5pt;margin-bottom:5pt"><p =
class=3D"MsoNormal">4) Should we consider increasing the cluster =
capacity?<u></u><u></u></p>
</blockquote><p class=3D"MsoNormal">IMHO yes.<u></u><u></u></p>
</div>
<div><p class=3D"MsoNormal">You may also want to do some experiments =
with turing compression on if it not already =
enabled.&nbsp;<u></u><u></u></p>
</div>
<div><p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div><p class=3D"MsoNormal">Having so much data on each node is a =
potential bad day. If instead you had to move or repair one of those =
nodes how long would it take for cassandra to stream all the data over ? =
(Or to rsync the data over.) How long does it take to run
 nodetool repair on the node ?<u></u><u></u></p>
</div>
<div><p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div><p class=3D"MsoNormal">With RF 3, if you lose a node you have lost =
your redundancy. It's important to have a plan about how to get it back =
and how long it may take. &nbsp;&nbsp;<u></u><u></u></p>
</div>
<div><p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div><p class=3D"MsoNormal">Hope that helps.&nbsp;<u></u><u></u></p>
</div>
<div><p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div><p class=3D"MsoNormal"><span =
style=3D"font-size:13.5pt;font-family:&quot;Helvetica&quot;,&quot;sans-ser=
if&quot;">-----------------<u></u><u></u></span></p>
</div>
<div><p class=3D"MsoNormal"><span =
style=3D"font-size:13.5pt;font-family:&quot;Helvetica&quot;,&quot;sans-ser=
if&quot;">Aaron Morton<u></u><u></u></span></p>
</div>
<div><p class=3D"MsoNormal"><span =
style=3D"font-size:13.5pt;font-family:&quot;Helvetica&quot;,&quot;sans-ser=
if&quot;">Freelance Cassandra Developer<u></u><u></u></span></p>
</div>
<div><p class=3D"MsoNormal"><span =
style=3D"font-size:13.5pt;font-family:&quot;Helvetica&quot;,&quot;sans-ser=
if&quot;">New Zealand<u></u><u></u></span></p>
</div>
<div><p class=3D"MsoNormal"><span =
style=3D"font-size:13.5pt;font-family:&quot;Helvetica&quot;,&quot;sans-ser=
if&quot;"><u></u>&nbsp;<u></u></span></p>
</div>
<div><p class=3D"MsoNormal"><span =
style=3D"font-size:13.5pt;font-family:&quot;Helvetica&quot;,&quot;sans-ser=
if&quot;">@aaronmorton<u></u><u></u></span></p>
</div>
<div><p class=3D"MsoNormal"><span =
style=3D"font-size:13.5pt;font-family:&quot;Helvetica&quot;,&quot;sans-ser=
if&quot;"><a href=3D"http://www.thelastpickle.com/" =
target=3D"_blank">http://www.thelastpickle.com</a><u></u><u></u></span></p=
>

</div>
</div>
</div>
</div>
</div>
</div>
</div><p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
<div>
<div><p class=3D"MsoNormal">On 6/12/2012, at 3:40 AM, Alexandru Sicoe =
&lt;<a href=3D"mailto:adsicoe@gmail.com" =
target=3D"_blank">adsicoe@gmail.com</a>&gt; wrote:<u></u><u></u></p>
</div><p class=3D"MsoNormal"><br>
<br>
<u></u><u></u></p><p class=3D"MsoNormal" style=3D"margin-bottom:12pt">Hi =
guys,<br>
Sorry for the late follow-up but I waited to run major compactions on =
all 3 nodes at a time before replying with my findings.<br>
<br>
Basically we were successful on two of the nodes. They both took ~2 days =
and 11 hours to complete and at the end we saw one very large file =
~900GB and the rest much smaller (the overall size decreased). This is =
what we expected!<br>

<br>
But on the 3rd node, we suspect major compaction didn't actually finish =
it's job. First of all nodetool compact returned much earlier than the =
rest - after one day and 15 hrs. Secondly from the 1.4TBs initially on =
the node only about 36GB were freed up (almost
 the same size as before). Saw nothing in the server log (debug not =
enabled). Below I pasted some more details about file sizes before and =
after compaction on this third node and disk occupancy.<br>
<br>
The situation is maybe not so dramatic for us because in less than 2 =
weeks we will have a down time till after the new year. During this we =
can completely delete all the data in the cluster and start fresh with =
TTLs for 1 month (as suggested by Aaron and 8GB
 heap as suggested by Alain - thanks).<br>
<br>
Questions:<br>
<br>
1) Do you expect problems with the 3rd node during 2 weeks more of =
operations, in the conditions seen below?
<br>
[Note: we expect the minor compactions to continue building up files but =
never really getting to compacting the large file and thus not needing =
much temporarily extra disk space].<br>
<br>
2) Should we restart with leveled compaction next year? <br>
[Note: Aaron was right, we have 1 week rows which get deleted after 1 =
month which means older rows end up in big files =3D&gt; to free up =
space with SizeTiered we will have no choice but run major compactions =
which we don't know if they will work provided that we
 get at ~1TB / node / 1 month. You can see we are at the limit!]<br>
<br>
3) In case we keep SizeTiered:<br>
<br>
&nbsp;&nbsp;&nbsp; - How can we improve the performance of our major =
compactions? (we left all config parameters as default). Would =
increasing compactions throughput interfere with writes and reads? What =
about multi-threaded compactions?<br>

<br>
&nbsp;&nbsp;&nbsp; - Do we still need to run regular repair operations =
as well? Do these also do a major compaction or are they completely =
separate operations?
<br>
<br>
[Note: we have 3 nodes with RF=3D2 and inserting at consistency level 1 =
and reading at consistency level ALL. We read primarily for exporting =
reasons - we export 1 week worth of data at a time].<br>
<br>
4) Should we consider increasing the cluster capacity?<br>
[We generate ~5million new rows every week which shouldn't come close to =
the hundreds of millions of rows on a node mentioned by Aaron which are =
the volumes that would create problems with bloom filters and =
indexes].<br>

<br>
Cheers,<br>
Alex<br>
------------------<br>
<br>
The situation in the data folder <br>
<br>
&nbsp;&nbsp;&nbsp; before calling nodetool comapact:<br>
<br>
du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db<br>
444G&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db<br>
376G&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-46431-Data.db<br>
305G&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-68959-Data.db<br>
39G&nbsp;&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-7352-Data.db<br>
78G&nbsp;&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-74076-Data.db<br>
81G&nbsp;&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-79663-Data.db<br>
205M&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80370-Data.db<br>
20G&nbsp;&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80968-Data.db<br>
20G&nbsp;&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-82330-Data.db<br>
20G&nbsp;&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-83710-Data.db<br>
4.9G&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84015-Data.db<br>
4.9G&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84356-Data.db<br>
4.9G&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84696-Data.db<br>
333M&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84707-Data.db<br>
92M&nbsp;&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84712-Data.db<br>
92M&nbsp;&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84717-Data.db<br>
99M&nbsp;&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84722-Data.db<br>
2.5G&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-tmp-he-84723-Data.db<br>
1.4T&nbsp;&nbsp;&nbsp; total<br>
<br>
&nbsp;&nbsp;&nbsp; after nodetool comapact returned:<br>
<br>
du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db<br>
444G&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db<br>
910G&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84723-Data.db<br>
19G&nbsp;&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-86229-Data.db<br>
19G&nbsp;&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87639-Data.db<br>
5.0G&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87923-Data.db<br>
4.8G&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88261-Data.db<br>
338M&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88271-Data.db<br>
339M&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88292-Data.db<br>
339M&nbsp;&nbsp;&nbsp; =
/data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88312-Data.db<br>
98M&nbsp; <br>
<br>
<br>
Looking at the disk occupancy for the logical partition where the data =
folder is in:<br>
<br>
df /data_bst<br>
Filesystem&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
1K-blocks&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Used Available Use% Mounted =
on<br>
=
/dev/sdb1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp=
; 2927242720 1482502260 1444740460&nbsp; 51% /data_bst<br>
<br>
<br>
and the situation in the cluster<br>
<br>
nodetool -h $HOSTNAME ring (before major compaction)<br>
Address&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
DC&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
Rack&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Status State&nbsp;&nbsp; =
Load&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
Effective-Ownership =
Token&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;
<br>
=
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp; =
113427455640312821154458202477256070484&nbsp;&nbsp;&nbsp;&nbsp;
<br>
10.146.44.17&nbsp;&nbsp;&nbsp; datacenter1 =
rack1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Up&nbsp;&nbsp;&nbsp;&nbsp; =
Normal&nbsp; 1.37 TB&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
66.67%&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp; =
0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<br>
10.146.44.18&nbsp;&nbsp;&nbsp; datacenter1 =
rack1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Up&nbsp;&nbsp;&nbsp;&nbsp; =
Normal&nbsp; 1.04 TB&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
66.67%&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp; =
56713727820156410577229101238628035242&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<br>
10.146.44.32&nbsp;&nbsp;&nbsp; datacenter1 =
rack1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Up&nbsp;&nbsp;&nbsp;&nbsp; =
Normal&nbsp; 1.14 TB&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
66.67%&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp; 113427455640312821154458202477256070484<br>
<br>
nodetool -h $HOSTNAME ring (after major compaction) (Note we were =
inserting data in the meantime)<br>
Address&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
DC&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
Rack&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Status State&nbsp;&nbsp; =
Load&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
Effective-Ownership =
Token&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;
<br>
=
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp; =
113427455640312821154458202477256070484&nbsp;&nbsp;&nbsp;&nbsp;
<br>
10.146.44.17&nbsp;&nbsp;&nbsp; datacenter1 =
rack1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Up&nbsp;&nbsp;&nbsp;&nbsp; =
Normal&nbsp; 1.38 TB&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
66.67%&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp; =
0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<br>
10.146.44.18&nbsp;&nbsp;&nbsp; datacenter1 =
rack1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Up&nbsp;&nbsp;&nbsp;&nbsp; =
Normal&nbsp; 1.08 TB&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
66.67%&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp; =
56713727820156410577229101238628035242&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<br>
10.146.44.32&nbsp;&nbsp;&nbsp; datacenter1 =
rack1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Up&nbsp;&nbsp;&nbsp;&nbsp; =
Normal&nbsp; 1.19 TB&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
66.67%&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp; 113427455640312821154458202477256070484<br>
<br>
<u></u><u></u></p>
<div><p class=3D"MsoNormal" =
style=3D"margin-bottom:12pt"><u></u>&nbsp;<u></u></p>
<div><p class=3D"MsoNormal">On Fri, Nov 23, 2012 at 2:16 AM, aaron =
morton &lt;<a href=3D"mailto:aaron@thelastpickle.com" =
target=3D"_blank">aaron@thelastpickle.com</a>&gt; =
wrote:<u></u><u></u></p>
<div><p class=3D"MsoNormal">&gt; &nbsp;=46rom what I know having too =
much data on one node is bad, not really sure why, but &nbsp;I think =
that performance will go down due to the size of indexes and bloom =
filters (I may be wrong on the reasons but I'm quite sure you can't =
store
 too much data per node).<u></u><u></u></p>
</div><p class=3D"MsoNormal">If you have many hundreds of millions of =
rows on a node the memory needed for bloom filters and index sampling =
can be significant. These can both be tuned.<br>
<br>
If you have 1.1T per node the time to do a compaction, repair or upgrade =
may be very significant. Also the time taken to copy this data should =
you need to remove or replace a node may be =
prohibitive.<u></u><u></u></p>
<div><p class=3D"MsoNormal"><br>
&gt; 2. Switch to Leveled compaction strategy.<u></u><u></u></p>
</div><p class=3D"MsoNormal">I would avoid making a change like that on =
an unstable / at risk system.<br>
<br>
&gt; - Our usage pattern is write once, read once (export) and delete =
once!<br>
<br>
&nbsp;The column TTL may be of use to you, it removes the need to do a =
delete.<br>
<br>
&gt; - We were thinking of relying on the automatic minor compactions to =
free up space for us but as..<br>
There are some usage patterns which make life harder for STS. For =
example if you have very long lived rows that are written to and deleted =
a lot. Row fragments that have been around for a while will end up in =
bigger files, and these files get compacted less
 often.<br>
<br>
In this situation, if you are running low on disk space and you think =
there is a lot of deleted data in there, I would run a major compaction. =
A word or warning though, if do this you will need to continue to do it =
regularly. Major compaction creates a single
 big file, that will not get compaction often. There are ways to resolve =
this, and moving to LDB may help in the future.<br>
<br>
If you are stuck and worried about disk space it's what I would do. Once =
you are stable again then look at LDB
<a =
href=3D"http://www.datastax.com/dev/blog/when-to-use-leveled-compaction" =
target=3D"_blank">
http://www.datastax.com/dev/blog/when-to-use-leveled-compaction</a><br>
<br>
Cheers<br>
<br>
-----------------<br>
Aaron Morton<br>
Freelance Cassandra Developer<br>
New Zealand<br>
<br>
@aaronmorton<br>
<a href=3D"http://www.thelastpickle.com/" =
target=3D"_blank">http://www.thelastpickle.com</a><u></u><u></u></p>
<div>
<div><p class=3D"MsoNormal" style=3D"margin-bottom:12pt"><br>
On 23/11/2012, at 9:18 AM, Alain RODRIGUEZ &lt;<a =
href=3D"mailto:arodrime@gmail.com" =
target=3D"_blank">arodrime@gmail.com</a>&gt; wrote:<br>
<br>
&gt; Hi Alexandru,<br>
&gt;<br>
&gt; "We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 =
disk per node for the data dir and separate disk for the commitlog, 12 =
cores, 24 GB RAM"<br>
&gt;<br>
&gt; I think you should tune your architecture in a very different way. =
=46rom what I know having too much data on one node is bad, not really =
sure why, but &nbsp;I think that performance will go down due to the =
size of indexes and bloom filters (I may be wrong on the
 reasons but I'm quite sure you can't store too much data per node).<br>
&gt;<br>
&gt; Anyway, I am 6 nodes with half of these resources (6 cores / 12GB) =
would be better if you have the choice.<br>
&gt;<br>
&gt; "(12GB to Cassandra heap)."<br>
&gt;<br>
&gt; The max heap recommanded is 8GB because if you use more than these =
8GB the Gc jobs will start decreasing your performance.<br>
&gt;<br>
&gt; "We now have 1.1 TB worth of data per node (RF =3D 2)."<br>
&gt;<br>
&gt; You should use RF=3D3 unless one out of consistency or SPOF =
&nbsp;doesn't matter to you.<br>
&gt;<br>
&gt; With RF=3D2 you are obliged to write at CL.one to remove the single =
point of failure.<br>
&gt;<br>
&gt; "1. Start issuing regular major compactions (nodetool compact).<br>
&gt; &nbsp; &nbsp; &nbsp;- This is not recommended:<br>
&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Stops minor =
compactions.<br>
&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Major performance hit =
on node (very bad for us because need to be taking data all the =
time)."<br>
&gt;<br>
&gt; Actually, major compaction *does not* stop minor compactions. What =
happens is that due to the size of the size of the sstable that remains =
after your major compaction, it will never be compacted with the =
upcoming new sstables, and because of that, your read
 performance will go down until you run an other major compaction.<br>
&gt;<br>
&gt; "2. Switch to Leveled compaction strategy.<br>
&gt; &nbsp; &nbsp; &nbsp; - It is mentioned to help with deletes and =
disk space usage. Can someone confirm?"<br>
&gt;<br>
&gt; =46rom what I know, Leveled compaction will not free disk space. It =
will allow you to use a greater percentage of your total disk space (50% =
max for sized tier compaction vs about 80% for leveled compaction)<br>
&gt;<br>
&gt; "Our usage pattern is write once, read once (export) and delete =
once! "<br>
&gt;<br>
&gt; In this case, I think that leveled compaction fits your needs.<br>
&gt;<br>
&gt; "Can anyone suggest which (if any) is better? Are there better =
solutions?"<br>
&gt;<br>
&gt; Are your sstable compressed ? You have 2 types of built-in =
compression and you may use them depending on the model of each of your =
CF.<br>
&gt;<br>
&gt; see: <a =
href=3D"http://www.datastax.com/docs/1.1/operations/tuning#configure-compr=
ession" target=3D"_blank">
=
http://www.datastax.com/docs/1.1/operations/tuning#configure-compression</=
a><br>
&gt;<br>
&gt; Alain<br>
&gt;<br>
&gt; 2012/11/22 Alexandru Sicoe &lt;<a href=3D"mailto:adsicoe@gmail.com" =
target=3D"_blank">adsicoe@gmail.com</a>&gt;<br>
&gt; We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 =
disk per node for the data dir and separate disk for the commitlog, 12 =
cores, 24 GB RAM (12GB to Cassandra heap).<br>
&gt;<u></u><u></u></p>
</div>
</div>
</div><p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
</div><p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
</div>
</div>

</blockquote></div><br>
</blockquote></div><br></body></html>=

--Apple-Mail=_22937C26-F466-4AF7-B87F-4BB754025CC9--