Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: softfail (athena.apache.org: transitioning domain of
 yyang@me.com does not designate 17.158.233.225 as permitted sender)
MIME-version: 1.0
Content-type: multipart/alternative;
 boundary="Boundary_(ID_Kxjj/26kn0Qg3rOjctHnGg)"
From: Yi Yang <yyang@me.com>
Subject: Re: Cassandra adding 500K + Super Column Family
Date: Tue, 16 Aug 2011 16:32:11 -0700
In-reply-to: <F0145021-27AD-4029-8FE9-006490AE3CC8@thelastpickle.com>
To: user@cassandra.apache.org
References: <4E4A6860.9070006@indabamobile.co.za>
 <F0145021-27AD-4029-8FE9-006490AE3CC8@thelastpickle.com>
Message-id: <BA8D841A-9810-4372-8A23-115092045537@me.com>


--Boundary_(ID_Kxjj/26kn0Qg3rOjctHnGg)
Content-type: text/plain; CHARSET=US-ASCII
Content-transfer-encoding: 7BIT

Sounds like it's a similar case as mine.   The files are definitely, extremely big, 10x space overhead should be a good case if you are just putting values into it.

I'm currently testing CASSANDRA-674 and hopes the better SSTable can solve the space overhead problem.   Please follow my e-mail today and I'll continuously work on it today.

If your values are integer and floats, with column name containing ~4 characters, as estimated from my case it will cost you 1~2TB of disk space.

Best,
Steve

On Aug 16, 2011, at 4:20 PM, aaron morton wrote:

> Are you planning to create 500,000 Super Column Families or 500,000 rows in a single Super Column Family ? 
> 
> The former is a somewhat crazy. Cassandra schemas typically have up to a few tens of Column Families. Each column family involves a certain amount of memory overhead, this is now automatically managed in Cassandra 0.8 (see http://thelastpickle.com/2011/05/04/How-are-Memtables-measured/)
> 
> if I understand correctly you have 500K entities with 6k columns each. A simple first approach to modelling this would be to use a Standard CF with a row for each entity. However the best model is the one that serves your read requests best. 
> 
> Also for background the sub columns in a super column are not indexed see http://wiki.apache.org/cassandra/CassandraLimitations . You would probably run into this problem if you had 6000 sub columns in a super column. 
> 
> Hope that helps. 
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 17/08/2011, at 12:53 AM, Renato Bacelar da Silveira wrote:
> 
>> I am wondering about a certain volume situation.
>> 
>> I currently load a Keyspace with a certain amount of SCFs.
>> 
>> Each SCF (Super Column Family) represents an entity.
>> 
>> Each Entity may have up to 6000 values.
>> 
>> I am planning to have 500,000 Entities (SCF) with
>> 6000 Columns (within Super Columns - number of Super Columns
>> unknown), and was wondering how much resources something
>> like this would require?
>> 
>> I am struggling to have 10,000 SCF with 30 Columns (within SuperColumns),
>> I get very large files, and reach a 4Gb heapspace limit very quickly on
>> a single node. I use Garbage Collection where needed.
>> 
>> Is there some secret to load 500,000 Super Column Families?
>> 
>> Regards.
>> -- 
>> Renato da Silveira
>> Senior Developer
> 


--Boundary_(ID_Kxjj/26kn0Qg3rOjctHnGg)
Content-type: text/html; CHARSET=US-ASCII
Content-transfer-encoding: quoted-printable

<html><head></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; =
"><div>Sounds like it's a similar case as mine. &nbsp; The files are =
definitely, extremely big, 10x space overhead should be a good case if =
you are just putting values into it.</div><div><br></div><div>I'm =
currently testing CASSANDRA-674 and hopes the better SSTable can solve =
the space overhead problem. &nbsp; Please follow my e-mail today and =
I'll continuously work on it today.</div><div><br></div><div>If your =
values are integer and floats, with column name containing ~4 =
characters, as estimated from my case it will cost you 1~2TB of disk =
space.</div><div><br></div><div>Best,</div><div>Steve</div><br><div><div>O=
n Aug 16, 2011, at 4:20 PM, aaron morton wrote:</div><br =
class=3D"Apple-interchange-newline"><blockquote type=3D"cite"><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; ">Are you planning to create =
500,000 Super Column Families or 500,000 rows in a single Super Column =
Family ?&nbsp;<div><br></div><div>The former is a somewhat crazy. =
Cassandra schemas typically have up to a few tens of Column Families. =
Each column family involves a certain amount of memory overhead, this is =
now automatically managed in Cassandra 0.8 (see&nbsp;<a =
href=3D"http://thelastpickle.com/2011/05/04/How-are-Memtables-measured/">h=
ttp://thelastpickle.com/2011/05/04/How-are-Memtables-measured/</a>)</div><=
div><br></div><div>if I understand correctly you have 500K entities with =
6k columns each. A simple first approach to modelling this would be to =
use a Standard CF with a row for each entity. However the best model is =
the one that serves your read requests =
best.&nbsp;</div><div><br></div><div>Also for background the sub columns =
in a super column are not indexed see&nbsp;<a =
href=3D"http://wiki.apache.org/cassandra/CassandraLimitations">http://wiki=
.apache.org/cassandra/CassandraLimitations</a>&nbsp;. You would probably =
run into this problem if you had 6000 sub columns in a super =
column.&nbsp;</div><div><br></div><div>Hope that =
helps.&nbsp;</div><div><div><br><div>
<span class=3D"Apple-style-span" style=3D"border-collapse: separate; =
font-family: Helvetica; font-style: normal; font-variant: normal; =
font-weight: normal; letter-spacing: normal; line-height: normal; =
orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; =
widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; =
font-family: Helvetica; font-style: normal; font-variant: normal; =
font-weight: normal; letter-spacing: normal; line-height: normal; =
orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; =
widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; font-family: Helvetica; font-style: =
normal; font-variant: normal; font-weight: normal; letter-spacing: =
normal; line-height: normal; orphans: 2; text-indent: 0px; =
text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; =
-webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: =
0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
"><div><div>-----------------</div><div>Aaron Morton</div><div>Freelance =
Cassandra Developer</div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com/">http://www.thelastpickle.com</a></d=
iv></div></div></span></div></span></span>
</div>

<br><div><div>On 17/08/2011, at 12:53 AM, Renato Bacelar da Silveira =
wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite"><div> I am wondering about a certain volume =
situation.<br><br>I currently load a Keyspace with a certain amount of =
SCFs.<br><br>Each SCF (Super Column Family) represents an =
entity.<br><br>Each Entity may have up to 6000 values.<br><br>I am =
planning to have 500,000 Entities (SCF) with<br>6000 Columns (within =
Super Columns - number of Super Columns<br>unknown), and was wondering =
how much resources something<br>like this would require?<br><br>I am =
struggling to have 10,000 SCF with 30 Columns (within =
SuperColumns),<br>I get very large files, and reach a 4Gb heapspace =
limit very quickly on<br>a single node. I use Garbage Collection where =
needed.<br><br>Is there some secret to load 500,000 Super Column =
Families?<br><br>Regards.<br>-- <br>Renato da Silveira<br>Senior =
Developer<br></div></blockquote></div><br></div></div></div></blockquote><=
/div><br></body></html>=

--Boundary_(ID_Kxjj/26kn0Qg3rOjctHnGg)--