Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B3E9C8791 for ; Tue, 16 Aug 2011 23:32:41 +0000 (UTC) Received: (qmail 48566 invoked by uid 500); 16 Aug 2011 23:32:39 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 48498 invoked by uid 500); 16 Aug 2011 23:32:39 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 48489 invoked by uid 99); 16 Aug 2011 23:32:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Aug 2011 23:32:38 +0000 X-ASF-Spam-Status: No, hits=3.2 required=5.0 tests=HTML_MESSAGE,SPF_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: softfail (athena.apache.org: transitioning domain of yyang@me.com does not designate 17.158.233.225 as permitted sender) Received: from [17.158.233.225] (HELO nk11p99mm-asmtpout004.mac.com) (17.158.233.225) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Aug 2011 23:32:32 +0000 MIME-version: 1.0 Content-type: multipart/alternative; boundary="Boundary_(ID_Kxjj/26kn0Qg3rOjctHnGg)" Received: from [10.10.10.128] ([209.119.60.1]) by nk11p99mm-asmtp004.mac.com (Oracle Communications Messaging Exchange Server 7u4-22.01 64bit (built Apr 21 2011)) with ESMTPSA id <0LQ10062FO1NUX80@nk11p99mm-asmtp004.mac.com> for user@cassandra.apache.org; Tue, 16 Aug 2011 16:32:12 -0700 (PDT) From: Yi Yang Subject: Re: Cassandra adding 500K + Super Column Family Date: Tue, 16 Aug 2011 16:32:11 -0700 In-reply-to: To: user@cassandra.apache.org References: <4E4A6860.9070006@indabamobile.co.za> Message-id: X-Mailer: Apple Mail (2.1249) --Boundary_(ID_Kxjj/26kn0Qg3rOjctHnGg) Content-type: text/plain; CHARSET=US-ASCII Content-transfer-encoding: 7BIT Sounds like it's a similar case as mine. The files are definitely, extremely big, 10x space overhead should be a good case if you are just putting values into it. I'm currently testing CASSANDRA-674 and hopes the better SSTable can solve the space overhead problem. Please follow my e-mail today and I'll continuously work on it today. If your values are integer and floats, with column name containing ~4 characters, as estimated from my case it will cost you 1~2TB of disk space. Best, Steve On Aug 16, 2011, at 4:20 PM, aaron morton wrote: > Are you planning to create 500,000 Super Column Families or 500,000 rows in a single Super Column Family ? > > The former is a somewhat crazy. Cassandra schemas typically have up to a few tens of Column Families. Each column family involves a certain amount of memory overhead, this is now automatically managed in Cassandra 0.8 (see http://thelastpickle.com/2011/05/04/How-are-Memtables-measured/) > > if I understand correctly you have 500K entities with 6k columns each. A simple first approach to modelling this would be to use a Standard CF with a row for each entity. However the best model is the one that serves your read requests best. > > Also for background the sub columns in a super column are not indexed see http://wiki.apache.org/cassandra/CassandraLimitations . You would probably run into this problem if you had 6000 sub columns in a super column. > > Hope that helps. > > ----------------- > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com > > On 17/08/2011, at 12:53 AM, Renato Bacelar da Silveira wrote: > >> I am wondering about a certain volume situation. >> >> I currently load a Keyspace with a certain amount of SCFs. >> >> Each SCF (Super Column Family) represents an entity. >> >> Each Entity may have up to 6000 values. >> >> I am planning to have 500,000 Entities (SCF) with >> 6000 Columns (within Super Columns - number of Super Columns >> unknown), and was wondering how much resources something >> like this would require? >> >> I am struggling to have 10,000 SCF with 30 Columns (within SuperColumns), >> I get very large files, and reach a 4Gb heapspace limit very quickly on >> a single node. I use Garbage Collection where needed. >> >> Is there some secret to load 500,000 Super Column Families? >> >> Regards. >> -- >> Renato da Silveira >> Senior Developer > --Boundary_(ID_Kxjj/26kn0Qg3rOjctHnGg) Content-type: text/html; CHARSET=US-ASCII Content-transfer-encoding: quoted-printable
Are you planning to create = 500,000 Super Column Families or 500,000 rows in a single Super Column = Family ? 

The former is a somewhat crazy. = Cassandra schemas typically have up to a few tens of Column Families. = Each column family involves a certain amount of memory overhead, this is = now automatically managed in Cassandra 0.8 (see h= ttp://thelastpickle.com/2011/05/04/How-are-Memtables-measured/)
<= div>
if I understand correctly you have 500K entities with = 6k columns each. A simple first approach to modelling this would be to = use a Standard CF with a row for each entity. However the best model is = the one that serves your read requests = best. 

Also for background the sub columns = in a super column are not indexed see http://wiki= .apache.org/cassandra/CassandraLimitations . You would probably = run into this problem if you had 6000 sub columns in a super = column. 

Hope that = helps. 

http://www.thelastpickle.com

On 17/08/2011, at 12:53 AM, Renato Bacelar da Silveira = wrote:

I am wondering about a certain volume = situation.

I currently load a Keyspace with a certain amount of = SCFs.

Each SCF (Super Column Family) represents an = entity.

Each Entity may have up to 6000 values.

I am = planning to have 500,000 Entities (SCF) with
6000 Columns (within = Super Columns - number of Super Columns
unknown), and was wondering = how much resources something
like this would require?

I am = struggling to have 10,000 SCF with 30 Columns (within = SuperColumns),
I get very large files, and reach a 4Gb heapspace = limit very quickly on
a single node. I use Garbage Collection where = needed.

Is there some secret to load 500,000 Super Column = Families?

Regards.
--
Renato da Silveira
Senior = Developer

<= /div>
= --Boundary_(ID_Kxjj/26kn0Qg3rOjctHnGg)--