Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of JEREMIAH.JORDAN@morningstar.com
 designates 64.18.2.210 as permitted sender)
From: Jeremiah Jordan <JEREMIAH.JORDAN@morningstar.com>
To: "<user@cassandra.apache.org>" <user@cassandra.apache.org>
Subject: Re: data size difference between supercolumn and regular column
Thread-Topic: data size difference between supercolumn and regular column
Thread-Index: AQHNDSMso5KY6se9wEiGDTdiiUuhsZaAgaCAgAPXnoCAAtVlgIAAC2SA
Date: Mon, 2 Apr 2012 03:25:25 +0000
Message-ID: <AA87EB04-8B50-4497-AAA6-32B7F91B4039@morningstar.com>
References: 
 <CABxBLH95PKDdXAURD62_YJrJaBeNE81-P_5U0OUm5Ldpk5imaA@mail.gmail.com>
 <CABxBLH-72qROTG8KB+oAcVRtTmw2RmGu9FW-3LTrMft9+W-Qdg@mail.gmail.com>
 <5459C0DB-DB2E-4A6F-B6CD-1ECF12B04ACF@thelastpickle.com>
 <CABxBLH8X545K8_Xzj11VwYifsrPJXeoFX9r+wp+rtM-fQj4TEg@mail.gmail.com>
In-Reply-To: 
 <CABxBLH8X545K8_Xzj11VwYifsrPJXeoFX9r+wp+rtM-fQj4TEg@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: multipart/alternative;
	boundary="_000_AA87EB048B504497AAA632B7F91B4039morningstarcom_"
MIME-Version: 1.0

--_000_AA87EB048B504497AAA632B7F91B4039morningstarcom_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Is that 80% with compression?  If not, the first thing to do is turn on com=
pression.  Cassandra doesn't behave well when it runs out of disk space.  Y=
ou really want to try and stay around 50%,  60-70% works, but only if it is=
 spread across multiple column families, and even then you can run into iss=
ues when doing repairs.

-Jeremiah


On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote:

Thanks Aaron.  Well I guess it is possible the data files from sueprcolumns=
 could've been reduced in size after compaction.

This bring yet another question.  Say I am on a shoestring budget and can o=
nly put together a cluster with very limited storage space.  The first iter=
ation of pushing data into cassandra would drive the disk usage up into the=
 80% range.  As time goes by, there will be updates to the data, and many c=
olumns will be overwritten.  If I just push the updates in, the disks will =
run out of space on all of the cluster nodes.  What would be the best way t=
o handle such a situation if I cannot to buy larger disks? Do I need to del=
ete the rows/columns that are going to be updated, do a compaction, and the=
n insert the updates?  Or is there a better way?  Thanks

-- Y.

On Sat, Mar 31, 2012 at 3:28 AM, aaron morton <aaron@thelastpickle.com<mail=
to:aaron@thelastpickle.com>> wrote:
does cassandra 1.0 perform some default compression?
No.

The on disk size depends to some degree on the work load.

If there are a lot of overwrites or deleted you may have rows/columns that =
need to be compacted. You may have some big old SSTables that have not been=
 compacted for a while.

There is some overhead involved in the super columns: the super col name, l=
ength of the name and the number of columns.

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com<http://www.thelastpickle.com/>

On 29/03/2012, at 9:47 AM, Yiming Sun wrote:

Actually, after I read an article on cassandra 1.0 compression just now ( h=
ttp://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression), I =
am more puzzled.  In our schema, we didn't specify any compression options =
-- does cassandra 1.0 perform some default compression? or is the data redu=
ction purely because of the schema change?  Thanks.

-- Y.

On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yiming.sun@gmail.com<mailto:yi=
ming.sun@gmail.com>> wrote:
Hi,

We are trying to estimate the amount of storage we need for a production ca=
ssandra cluster.  While I was doing the calculation, I noticed a very drama=
tic difference in terms of storage space used by cassandra data files.

Our previous setup consists of a single-node cassandra 0.8.x with no replic=
ation, and the data is stored using supercolumns, and the data files total =
about 534GB on disk.

A few weeks ago, I put together a cluster consisting of 3 nodes running cas=
sandra 1.0 with replication factor of 2, and the data is flattened out and =
stored using regular columns.  And the aggregated data file size is only 48=
8GB (would be 244GB if no replication).

This is a very dramatic reduction in terms of storage needs, and is certain=
ly good news in terms of how much storage we need to provision.  However, b=
ecause of the dramatic reduction, I also would like to make sure it is abso=
lutely correct before submitting it - and also get a sense of why there was=
 such a difference. -- I know cassandra 1.0 does data compression, but does=
 the schema change from supercolumn to regular column also help reduce stor=
age usage?  Thanks.

-- Y.


--_000_AA87EB048B504497AAA632B7F91B4039morningstarcom_
Content-Type: text/html; charset="iso-8859-1"
Content-ID: <98C3D093FD552F4580D2B95D011B0CF9@morningstar.com>
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-=
1">
</head>
<body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-lin=
e-break: after-white-space; ">
Is that 80% with compression? &nbsp;If not, the first thing to do is turn o=
n compression. &nbsp;Cassandra doesn't behave well when it runs out of disk=
 space. &nbsp;You really want to try and stay around 50%, &nbsp;60-70% work=
s, but only if it is spread across multiple column families,
 and even then you can run into issues when doing repairs.
<div><br>
</div>
<div>-Jeremiah<br>
<div><br>
</div>
<div><br>
<div>
<div>On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote:</div>
<br class=3D"Apple-interchange-newline">
<blockquote type=3D"cite">Thanks Aaron. &nbsp;Well I guess it is possible t=
he data files from sueprcolumns could've been reduced in size after compact=
ion.
<div><br>
</div>
<div>This bring yet another question. &nbsp;Say I am on a shoestring budget=
 and can only put together a cluster with very limited storage space. &nbsp=
;The first iteration of pushing data into cassandra would drive the disk us=
age up into the 80% range. &nbsp;As time goes by,
 there will be updates to the data, and many columns will be overwritten. &=
nbsp;If I just push the updates in, the disks will run out of space on all =
of the cluster nodes. &nbsp;What would be the best way to handle such a sit=
uation if I cannot to buy larger disks? Do
 I need to delete the rows/columns that are going to be updated, do a compa=
ction, and then insert the updates? &nbsp;Or is there a better way? &nbsp;T=
hanks</div>
<div><br>
</div>
<div>-- Y.<br>
<br>
<div class=3D"gmail_quote">On Sat, Mar 31, 2012 at 3:28 AM, aaron morton <s=
pan dir=3D"ltr">
&lt;<a href=3D"mailto:aaron@thelastpickle.com">aaron@thelastpickle.com</a>&=
gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div style=3D"word-wrap:break-word">
<div>
<div class=3D"im">
<blockquote type=3D"cite">does cassandra 1.0 perform some default compressi=
on?&nbsp;</blockquote>
</div>
No.&nbsp;</div>
<div><br>
</div>
The on disk size depends to some degree on the work load.&nbsp;
<div><br>
</div>
<div>If there are a lot of overwrites or deleted you may have rows/columns =
that need to be compacted. You may have some big old SSTables that have not=
 been compacted for a while.&nbsp;</div>
<div><br>
</div>
<div>There is some overhead involved in the super columns: the super col na=
me, length of the name and the number of columns. &nbsp;</div>
<div><br>
</div>
<div>Cheers</div>
<div><br>
</div>
<div>
<div><span style=3D"text-indent:0px;letter-spacing:normal;font-variant:norm=
al;text-align:-webkit-auto;font-style:normal;font-weight:normal;line-height=
:normal;border-collapse:separate;text-transform:none;font-size:medium;white=
-space:normal;font-family:Helvetica;word-spacing:0px"><span style=3D"text-i=
ndent:0px;letter-spacing:normal;font-variant:normal;font-style:normal;font-=
weight:normal;line-height:normal;border-collapse:separate;text-transform:no=
ne;font-size:medium;white-space:normal;font-family:Helvetica;word-spacing:0=
px">
<div style=3D"word-wrap:break-word"><span style=3D"text-indent:0px;letter-s=
pacing:normal;font-variant:normal;font-style:normal;font-weight:normal;line=
-height:normal;border-collapse:separate;text-transform:none;font-size:mediu=
m;white-space:normal;font-family:Helvetica;word-spacing:0px">
<div style=3D"word-wrap:break-word"><span style=3D"text-indent:0px;letter-s=
pacing:normal;font-variant:normal;font-style:normal;font-weight:normal;line=
-height:normal;border-collapse:separate;text-transform:none;font-size:mediu=
m;white-space:normal;font-family:Helvetica;word-spacing:0px">
<div style=3D"word-wrap:break-word">
<div>
<div>-----------------</div>
<div>Aaron Morton</div>
<div>Freelance Developer</div>
<div>@aaronmorton</div>
<div><a href=3D"http://www.thelastpickle.com/" target=3D"_blank">http://www=
.thelastpickle.com</a></div>
</div>
</div>
</span></div>
</span></div>
</span></span></div>
<div>
<div class=3D"h5"><br>
<div>
<div>On 29/03/2012, at 9:47 AM, Yiming Sun wrote:</div>
<br>
<blockquote type=3D"cite">Actually, after I read an article on cassandra 1.=
0 compression just now (
<a href=3D"http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-comp=
ression" target=3D"_blank">
http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression</a>=
), I am more puzzled. &nbsp;In our schema, we didn't specify any compressio=
n options -- does cassandra 1.0 perform some default compression? or is the=
 data reduction purely because of the
 schema change? &nbsp;Thanks.
<div><br>
</div>
<div>-- Y.<br>
<br>
<div class=3D"gmail_quote">On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <spa=
n dir=3D"ltr">
&lt;<a href=3D"mailto:yiming.sun@gmail.com" target=3D"_blank">yiming.sun@gm=
ail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
Hi,
<div><br>
</div>
<div>We are trying to estimate the amount of storage we need for a producti=
on cassandra cluster. &nbsp;While I was doing the calculation, I noticed a =
very dramatic difference in terms of storage space used by cassandra data f=
iles.</div>
<div><br>
</div>
<div>Our previous setup consists of a single-node cassandra 0.8.x with no r=
eplication, and the data is stored using supercolumns, and the data files t=
otal about 534GB on disk.</div>
<div><br>
</div>
<div>A few weeks ago, I put together a cluster consisting of 3 nodes runnin=
g cassandra 1.0 with replication factor of 2, and the data is flattened out=
 and stored using regular columns. &nbsp;And the aggregated data file size =
is only 488GB (would be 244GB if no replication).</div>
<div><br>
</div>
<div>This is a very dramatic reduction in terms of storage needs, and is ce=
rtainly good news in terms of how much storage we need to provision. &nbsp;=
However, because of the dramatic reduction, I also would like to make sure =
it is absolutely correct before submitting
 it - and also get a sense of why there was such a difference. -- I know ca=
ssandra 1.0 does data compression, but does the schema change from supercol=
umn to regular column also help reduce storage usage? &nbsp;Thanks.</div>
<span><font color=3D"#888888">
<div><br>
</div>
<div>-- Y.</div>
</font></span></blockquote>
</div>
<br>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</body>
</html>

--_000_AA87EB048B504497AAA632B7F91B4039morningstarcom_--