Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from
	:mime-version:content-type:subject:date:in-reply-to:to
	:references:message-id; q=dns; s=thelastpickle.com; b=gBLiakOAeD
	yyJt2TrsuT81koTjY5GDevz0sqcuZFrExhaDT+5hcpXLjjQnDs6n6rDw8fyuEHAO
	MrEYeiceULPkZ/mCPwCn7Y/zRYzvs0BpccrR7opEJcndpvbBJxz45rZ+1sN1iJtB
	yRh735IjAwTAPAFTakYJv3k2qKmkvBfiE=
From: aaron morton <aaron@thelastpickle.com>
Mime-Version: 1.0 (Apple Message framework v1084)
Content-Type: multipart/alternative; boundary=Apple-Mail-2-773858108
Subject: Re: [howto measure disk usage]
Date: Mon, 16 May 2011 10:29:32 +1200
In-Reply-To: <BANLkTimS_qQUfxzfXki==kMvu88p8+6D=w@mail.gmail.com>
To: user@cassandra.apache.org
References: <BANLkTimS_qQUfxzfXki==kMvu88p8+6D=w@mail.gmail.com>
Message-Id: <4515F3DC-0A43-441F-8883-938BE01F79B5@thelastpickle.com>


--Apple-Mail-2-773858108
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=iso-8859-1

Sub columns for a super column do serialise their time stamp, they are =
just the same as regular column. The super column does not have a =
timestamp of it's own. It does have it's own tombstone marker though.=20

Super Column does not take a huge amount more disk space, just the name =
a shot int, two ints and a long int.

Some things to consider:

- were their any compacted files on disk ? these are sstables that have =
one zero length file with COMPACTED in the name.  These files will be =
deleted at some point.=20
- What did the commit log directory look like ? Flushing should have =
check pointed all the log segments and deleted the log files.=20
- I'm assuming this was a single node, if not was the node collecting =
Hinted=20
- Did the standard CF have cache saving enabled ?

Take a poke around the /var/lib/cassandra tree and let us know if you =
see anything interesting.=20

Cheers
 =20
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 14 May 2011, at 03:15, Alexis Rodr=EDguez wrote:

> cassandra-people,
>=20
> I'm trying to measure disk usage by cassandra after inserting some =
columns in order to plan disk sizes and configurations for future =
deploys.=20
>=20
> My approach is very straightforward:
>=20
> clean_data (stop_cassandra && rm -rf =
/var/lib/cassandra/{dara,commitlog,saved_caches}/*)
> perform_inserts
> measure_disk_usage (nodetool -flush && du -ch /var/lib/cassandra)
>=20
> There are two types of inserts:
> In a simple column with key, name and value a random string of size =
100
> In a super-column with key, super-column-name, name and value a random =
string of size 100
> But surprisingly when I'm inserting 100 million columns on a simple =
column it uses more disk than the same amount in a super-column. How can =
that be possible?
>=20
> For simple column 41984 MB and for super-column 29696, the difference =
is more than noticeable!
>=20
> Somebody told me yesterday that super-columns don't have a per-column =
timestamp, but... it in my case, even if every data was in the same =
super-column-key it will not explain the difference!
>=20
>=20
> ps: sorry, English is not my first language
>=20
>=20
> <results.eps>


--Apple-Mail-2-773858108
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=iso-8859-1

<html><head></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Sub =
columns for a super column do serialise their time stamp, they are just =
the same as regular column. The super column does not have a timestamp =
of it's own. It does have it's own tombstone marker =
though.&nbsp;<div><br></div><div>Super Column does not take a huge =
amount more disk space, just the name a shot int, two ints and a long =
int.</div><div><br></div><div>Some things to =
consider:</div><div><br></div><div>- were their any compacted files on =
disk ? these are sstables that have one zero length file with COMPACTED =
in the name. &nbsp;These files will be deleted at some =
point.&nbsp;</div><div>- What did the commit log directory look like ? =
Flushing should have check pointed all the log segments and deleted the =
log files.&nbsp;</div><div>- I'm assuming this was a single node, if not =
was the node collecting Hinted&nbsp;</div><div>- Did the standard CF =
have cache saving enabled ?</div><div><br></div><div>Take a poke around =
the /var/lib/cassandra tree and let us know if you see anything =
interesting.&nbsp;</div><div><br></div><div>Cheers</div><div>&nbsp;&nbsp;<=
/div><div><div>
<span class=3D"Apple-style-span" style=3D"border-collapse: separate; =
color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; =
font-variant: normal; font-weight: normal; letter-spacing: normal; =
line-height: normal; orphans: 2; text-align: auto; text-indent: 0px; =
text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; =
-webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: =
0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: =
0px; -webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
"><div><div>-----------------</div><div>Aaron Morton</div><div>Freelance =
Cassandra Developer</div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com">http://www.thelastpickle.com</a></di=
v></div></div></span></div></span></span>
</div>

<br><div><div>On 14 May 2011, at 03:15, Alexis Rodr=EDguez =
wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite"><font class=3D"Apple-style-span" face=3D"verdana, =
sans-serif"><div><font class=3D"Apple-style-span" face=3D"verdana, =
sans-serif">cassandra-people,</font></div><div><font =
class=3D"Apple-style-span" face=3D"verdana, =
sans-serif"><br></font></div>
I'm trying to measure disk usage by cassandra after inserting some =
columns in order to plan disk sizes and configurations for future =
deploys.&nbsp;</font><div><font class=3D"Apple-style-span" =
face=3D"verdana, sans-serif"><br></font></div>
<div><font class=3D"Apple-style-span" face=3D"verdana, sans-serif">My =
approach is very straightforward:</font></div><div><font =
class=3D"Apple-style-span" face=3D"verdana, =
sans-serif"><br></font></div><div><font class=3D"Apple-style-span" =
face=3D"verdana, sans-serif">clean_data (stop_cassandra &amp;&amp; rm =
-rf /var/lib/cassandra/{dara,commitlog,saved_caches}/*)</font></div>
<div><font class=3D"Apple-style-span" face=3D"verdana, =
sans-serif">perform_inserts</font></div><div><font =
class=3D"Apple-style-span" face=3D"verdana, =
sans-serif">measure_disk_usage (nodetool -flush &amp;&amp;&nbsp;du -ch =
/var/lib/cassandra)</font></div>
<div><font class=3D"Apple-style-span" face=3D"verdana, =
sans-serif"><br></font></div><div><font class=3D"Apple-style-span" =
face=3D"verdana, sans-serif">There are two types of =
inserts:</font></div><div><ul><li><font class=3D"Apple-style-span" =
face=3D"verdana, sans-serif">In a simple column with key, name and value =
a random string of size 100</font></li>
<li><font class=3D"Apple-style-span" face=3D"verdana, sans-serif">In a =
super-column with key, super-column-name, name and value a random string =
of size 100</font></li></ul></div><div><font class=3D"Apple-style-span" =
face=3D"verdana, sans-serif">But surprisingly when I'm inserting 100 =
million columns on a simple column it uses more disk than the same =
amount in a super-column. How can that be possible?</font></div>
<div><font class=3D"Apple-style-span" face=3D"verdana, =
sans-serif"><br></font></div><div><font class=3D"Apple-style-span" =
face=3D"verdana, sans-serif">For simple column 41984 MB and for =
super-column 29696, the difference is more than noticeable!</font></div>
<div><font class=3D"Apple-style-span" face=3D"verdana, =
sans-serif"><br></font></div><div><font class=3D"Apple-style-span" =
face=3D"verdana, sans-serif">Somebody told me yesterday that =
super-columns don't have a per-column timestamp, but... it in my case, =
even if every data was in the same super-column-key it will not explain =
the difference!</font></div>
<div><font class=3D"Apple-style-span" face=3D"verdana, =
sans-serif"><br></font></div><div><font class=3D"Apple-style-span" =
face=3D"verdana, sans-serif"><br></font></div><div><font =
class=3D"Apple-style-span" face=3D"verdana, sans-serif">ps: sorry, =
English is not my first language</font></div>
<div><font class=3D"Apple-style-span" face=3D"verdana, =
sans-serif"><br></font></div><div><font class=3D"Apple-style-span" =
face=3D"verdana, sans-serif"><br></font></div>

=
<span>&lt;results.eps&gt;</span></blockquote></div><br></div></body></html=
>=

--Apple-Mail-2-773858108--