From user-return-25236-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Mon Apr 2 03:26:01 2012 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8ECFB95C7 for ; Mon, 2 Apr 2012 03:26:01 +0000 (UTC) Received: (qmail 96506 invoked by uid 500); 2 Apr 2012 03:25:59 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 96463 invoked by uid 500); 2 Apr 2012 03:25:58 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 96445 invoked by uid 99); 2 Apr 2012 03:25:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Apr 2012 03:25:57 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of JEREMIAH.JORDAN@morningstar.com designates 64.18.2.210 as permitted sender) Received: from [64.18.2.210] (HELO exprod7og127.obsmtp.com) (64.18.2.210) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Apr 2012 03:25:49 +0000 Received: from MSEXET81.morningstar.com ([216.228.224.45]) (using TLSv1) by exprod7ob127.postini.com ([64.18.6.12]) with SMTP ID DSNKT3kcJyI32u9d+HnIKfLF56/FdmB1/lpL@postini.com; Sun, 01 Apr 2012 20:25:28 PDT Received: from MSEXCHM82.morningstar.com (172.28.13.42) by MSEXET81.morningstar.com (172.28.6.45) with Microsoft SMTP Server (TLS) id 14.2.247.3; Sun, 1 Apr 2012 22:25:04 -0500 Received: from MSEXCHM83.morningstar.com ([fe80::9529:19c5:7200:611e]) by MSEXCHM82.morningstar.com ([fe80::480c:4cff:6113:7a85%20]) with mapi id 14.02.0247.003; Sun, 1 Apr 2012 22:25:26 -0500 From: Jeremiah Jordan To: "" Subject: Re: data size difference between supercolumn and regular column Thread-Topic: data size difference between supercolumn and regular column Thread-Index: AQHNDSMso5KY6se9wEiGDTdiiUuhsZaAgaCAgAPXnoCAAtVlgIAAC2SA Date: Mon, 2 Apr 2012 03:25:25 +0000 Message-ID: References: <5459C0DB-DB2E-4A6F-B6CD-1ECF12B04ACF@thelastpickle.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [71.201.190.179] Content-Type: multipart/alternative; boundary="_000_AA87EB048B504497AAA632B7F91B4039morningstarcom_" MIME-Version: 1.0 --_000_AA87EB048B504497AAA632B7F91B4039morningstarcom_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Is that 80% with compression? If not, the first thing to do is turn on com= pression. Cassandra doesn't behave well when it runs out of disk space. Y= ou really want to try and stay around 50%, 60-70% works, but only if it is= spread across multiple column families, and even then you can run into iss= ues when doing repairs. -Jeremiah On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote: Thanks Aaron. Well I guess it is possible the data files from sueprcolumns= could've been reduced in size after compaction. This bring yet another question. Say I am on a shoestring budget and can o= nly put together a cluster with very limited storage space. The first iter= ation of pushing data into cassandra would drive the disk usage up into the= 80% range. As time goes by, there will be updates to the data, and many c= olumns will be overwritten. If I just push the updates in, the disks will = run out of space on all of the cluster nodes. What would be the best way t= o handle such a situation if I cannot to buy larger disks? Do I need to del= ete the rows/columns that are going to be updated, do a compaction, and the= n insert the updates? Or is there a better way? Thanks -- Y. On Sat, Mar 31, 2012 at 3:28 AM, aaron morton > wrote: does cassandra 1.0 perform some default compression? No. The on disk size depends to some degree on the work load. If there are a lot of overwrites or deleted you may have rows/columns that = need to be compacted. You may have some big old SSTables that have not been= compacted for a while. There is some overhead involved in the super columns: the super col name, l= ength of the name and the number of columns. Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 29/03/2012, at 9:47 AM, Yiming Sun wrote: Actually, after I read an article on cassandra 1.0 compression just now ( h= ttp://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression), I = am more puzzled. In our schema, we didn't specify any compression options = -- does cassandra 1.0 perform some default compression? or is the data redu= ction purely because of the schema change? Thanks. -- Y. On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun > wrote: Hi, We are trying to estimate the amount of storage we need for a production ca= ssandra cluster. While I was doing the calculation, I noticed a very drama= tic difference in terms of storage space used by cassandra data files. Our previous setup consists of a single-node cassandra 0.8.x with no replic= ation, and the data is stored using supercolumns, and the data files total = about 534GB on disk. A few weeks ago, I put together a cluster consisting of 3 nodes running cas= sandra 1.0 with replication factor of 2, and the data is flattened out and = stored using regular columns. And the aggregated data file size is only 48= 8GB (would be 244GB if no replication). This is a very dramatic reduction in terms of storage needs, and is certain= ly good news in terms of how much storage we need to provision. However, b= ecause of the dramatic reduction, I also would like to make sure it is abso= lutely correct before submitting it - and also get a sense of why there was= such a difference. -- I know cassandra 1.0 does data compression, but does= the schema change from supercolumn to regular column also help reduce stor= age usage? Thanks. -- Y. --_000_AA87EB048B504497AAA632B7F91B4039morningstarcom_ Content-Type: text/html; charset="iso-8859-1" Content-ID: <98C3D093FD552F4580D2B95D011B0CF9@morningstar.com> Content-Transfer-Encoding: quoted-printable Is that 80% with compression?  If not, the first thing to do is turn o= n compression.  Cassandra doesn't behave well when it runs out of disk= space.  You really want to try and stay around 50%,  60-70% work= s, but only if it is spread across multiple column families, and even then you can run into issues when doing repairs.

-Jeremiah


On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote:

Thanks Aaron.  Well I guess it is possible t= he data files from sueprcolumns could've been reduced in size after compact= ion.

This bring yet another question.  Say I am on a shoestring budget= and can only put together a cluster with very limited storage space.  = ;The first iteration of pushing data into cassandra would drive the disk us= age up into the 80% range.  As time goes by, there will be updates to the data, and many columns will be overwritten. &= nbsp;If I just push the updates in, the disks will run out of space on all = of the cluster nodes.  What would be the best way to handle such a sit= uation if I cannot to buy larger disks? Do I need to delete the rows/columns that are going to be updated, do a compa= ction, and then insert the updates?  Or is there a better way?  T= hanks

-- Y.

On Sat, Mar 31, 2012 at 3:28 AM, aaron morton <aaron@thelastpickle.com&= gt; wrote:
does cassandra 1.0 perform some default compressi= on? 
No. 

The on disk size depends to some degree on the work load. 

If there are a lot of overwrites or deleted you may have rows/columns = that need to be compacted. You may have some big old SSTables that have not= been compacted for a while. 

There is some overhead involved in the super columns: the super col na= me, length of the name and the number of columns.  

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton

On 29/03/2012, at 9:47 AM, Yiming Sun wrote:

Actually, after I read an article on cassandra 1.= 0 compression just now ( http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression= ), I am more puzzled.  In our schema, we didn't specify any compressio= n options -- does cassandra 1.0 perform some default compression? or is the= data reduction purely because of the schema change?  Thanks.

-- Y.

On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yiming.sun@gm= ail.com> wrote:
Hi,

We are trying to estimate the amount of storage we need for a producti= on cassandra cluster.  While I was doing the calculation, I noticed a = very dramatic difference in terms of storage space used by cassandra data f= iles.

Our previous setup consists of a single-node cassandra 0.8.x with no r= eplication, and the data is stored using supercolumns, and the data files t= otal about 534GB on disk.

A few weeks ago, I put together a cluster consisting of 3 nodes runnin= g cassandra 1.0 with replication factor of 2, and the data is flattened out= and stored using regular columns.  And the aggregated data file size = is only 488GB (would be 244GB if no replication).

This is a very dramatic reduction in terms of storage needs, and is ce= rtainly good news in terms of how much storage we need to provision.  = However, because of the dramatic reduction, I also would like to make sure = it is absolutely correct before submitting it - and also get a sense of why there was such a difference. -- I know ca= ssandra 1.0 does data compression, but does the schema change from supercol= umn to regular column also help reduce storage usage?  Thanks.

-- Y.




--_000_AA87EB048B504497AAA632B7F91B4039morningstarcom_--