Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: local policy)
From: "Hiller, Dean" <Dean.Hiller@nrel.gov>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Wed, 18 Sep 2013 13:38:07 -0600
Subject: Re: What is the ideal value for sstable_size_in_mb when using
 LeveledCompactionStrategy ?
Thread-Topic: What is the ideal value for sstable_size_in_mb when using
 LeveledCompactionStrategy ?
Thread-Index: Ac60ppspWaVMa5XURXm796UNjlzRUg==
Message-ID: <CE5F5D19.3257B%Dean.Hiller@nrel.gov>
In-Reply-To: 
 <CAH9XgomhwC4e1A+JbWuSYNzueeyb01_yCprW39Todqt1R8OAug@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.3.7.130812
acceptlanguage: en-US
Content-Type: text/plain; charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Sorry, bad bad typo=85..300G is what I meant.

Cassandra heavily advises to stay under 1T per node or you run into big tro=
ubles and most people stay under 500G per node.

Later,
Dean

From: Jayadev Jayaraman <jdisalive@gmail.com<mailto:jdisalive@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <us=
er@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Wednesday, September 18, 2013 1:30 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cas=
sandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: What is the ideal value for sstable_size_in_mb when using Leve=
ledCompactionStrategy ?

Thanks for the quick reply. We've already upped the ulimit as high as our L=
inux distro allows us to ( around 1.8 million  ).

I have a follow-up question. I see that the size of individual nodes in you=
r use case is quite massive. Does the safe number vary widely based on diff=
erences in underlying hardware, or would you say from experience that somet=
hing around 50M for medium to large datasets ( with upped file-descriptor l=
imits ) is safe for most medium-sized (1 - 5 TB per node) to high-end (hund=
reds of TB) hardware ?


On Wed, Sep 18, 2013 at 3:15 PM, Hiller, Dean <Dean.Hiller@nrel.gov<mailto:=
Dean.Hiller@nrel.gov>> wrote:
 1.  Always in cassandra up your file descriptor limits on linux and even i=
n 0.7 that was the recommendation so cassandra could open tons of files
 2.  We use 50M for our LCS with no performance issues.  We had it 10M on o=
ur previous with no issues but a huge amount of files of course with our 30=
0T per node.

Dean

From: Jayadev Jayaraman <jdisalive@gmail.com<mailto:jdisalive@gmail.com><ma=
ilto:jdisalive@gmail.com<mailto:jdisalive@gmail.com>>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mail=
to:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>" <user@cass=
andra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.ap=
ache.org<mailto:user@cassandra.apache.org>>>
Date: Wednesday, September 18, 2013 1:02 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:use=
r@cassandra.apache.org<mailto:user@cassandra.apache.org>>" <user@cassandra.=
apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.o=
rg<mailto:user@cassandra.apache.org>>>
Subject: What is the ideal value for sstable_size_in_mb when using LeveledC=
ompactionStrategy ?

We have set up a 24 node (m1.xlarge nodes, 1.7 TB per node) cassandra clust=
er on Amazon EC2 :

version=3D1.2.9
replication factor =3D 2
snitch=3DEC2Snitch
placement_strategy=3DNetworkTopologyStrategy (with 12 nodes each in 2 avail=
ability zones)

Background on our use-case :

We plan on using hadoop with sstableloader to load 10GB+ of analytics data =
per day ( 100 million+ row keys, 5 or so columns per day on average.) . We =
have chosen LeveledCompactionStrategy in the hope that it constrains the nu=
mber of SSTables that are read in order to retrieve a sliced-predicate for =
a row. We don't want too many file-sockets ( > 1000) open to SSTables by th=
e Cassandra JVM as this has caused us network / unreachability issues befor=
e. We faced this when we were on cassandra 0.8.9 and we were using SizeTier=
edCompactionStrategy and in order to mitigate this, we ran minor compaction=
 daily and major compaction semi-regularly to ensure as few SSTable files a=
s possible on disk.


If we use LeveledCompactionStrategy with a small value for sstable_size_in_=
mb ( default =3D 5 MB ) , wouldn't that result in a very large number of SS=
Table files on disk ? How does that affect the number of file-sockets open =
(reading the docs, I get the impression that the number of SSTable seeks pe=
r query is reduced by a large margin) ? But if we use a larger value for ss=
table_size_in_mb, say around 200 MB, there will be 800 MB of small uncompac=
ted SSTables on disk per column-family to which there will inevitably be fi=
le-sockets open.

All in all, can someone help us figure out what we should set the value of =
sstable_size_in_mb to ? I figure it's not a very good idea to set it to a l=
arger value but I don't know how things perform if we set it to a small val=
ue. Do we have to run major compaction regularly in this case too ?

Thanks
Jayadev