incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yiming Sun <yiming....@gmail.com>
Subject Re: data size difference between supercolumn and regular column
Date Mon, 02 Apr 2012 21:19:10 GMT
Yup Jeremiah, I learned a hard lesson on how cassandra behaves when it runs
out of disk space :-S.    I didn't try the compression, but when it ran out
of disk space, or near running out, compaction would fail because it needs
space to create some tmp data files.

I shall get a tatoo that says keep it around 50% -- this is valuable tip.

-- Y.

On Sun, Apr 1, 2012 at 11:25 PM, Jeremiah Jordan <
JEREMIAH.JORDAN@morningstar.com> wrote:

>  Is that 80% with compression?  If not, the first thing to do is turn on
> compression.  Cassandra doesn't behave well when it runs out of disk space.
>  You really want to try and stay around 50%,  60-70% works, but only if it
> is spread across multiple column families, and even then you can run into
> issues when doing repairs.
>
>  -Jeremiah
>
>
>
>  On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote:
>
> Thanks Aaron.  Well I guess it is possible the data files from
> sueprcolumns could've been reduced in size after compaction.
>
>  This bring yet another question.  Say I am on a shoestring budget and
> can only put together a cluster with very limited storage space.  The first
> iteration of pushing data into cassandra would drive the disk usage up into
> the 80% range.  As time goes by, there will be updates to the data, and
> many columns will be overwritten.  If I just push the updates in, the disks
> will run out of space on all of the cluster nodes.  What would be the best
> way to handle such a situation if I cannot to buy larger disks? Do I need
> to delete the rows/columns that are going to be updated, do a compaction,
> and then insert the updates?  Or is there a better way?  Thanks
>
>  -- Y.
>
> On Sat, Mar 31, 2012 at 3:28 AM, aaron morton <aaron@thelastpickle.com>wrote:
>
>>   does cassandra 1.0 perform some default compression?
>>
>>  No.
>>
>>  The on disk size depends to some degree on the work load.
>>
>>  If there are a lot of overwrites or deleted you may have rows/columns
>> that need to be compacted. You may have some big old SSTables that have not
>> been compacted for a while.
>>
>>  There is some overhead involved in the super columns: the super col
>> name, length of the name and the number of columns.
>>
>>  Cheers
>>
>>     -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>>  On 29/03/2012, at 9:47 AM, Yiming Sun wrote:
>>
>> Actually, after I read an article on cassandra 1.0 compression just now (
>> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression),
>> I am more puzzled.  In our schema, we didn't specify any compression
>> options -- does cassandra 1.0 perform some default compression? or is the
>> data reduction purely because of the schema change?  Thanks.
>>
>>  -- Y.
>>
>> On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yiming.sun@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>  We are trying to estimate the amount of storage we need for a
>>> production cassandra cluster.  While I was doing the calculation, I noticed
>>> a very dramatic difference in terms of storage space used by cassandra data
>>> files.
>>>
>>>  Our previous setup consists of a single-node cassandra 0.8.x with no
>>> replication, and the data is stored using supercolumns, and the data files
>>> total about 534GB on disk.
>>>
>>>  A few weeks ago, I put together a cluster consisting of 3 nodes
>>> running cassandra 1.0 with replication factor of 2, and the data is
>>> flattened out and stored using regular columns.  And the aggregated data
>>> file size is only 488GB (would be 244GB if no replication).
>>>
>>>  This is a very dramatic reduction in terms of storage needs, and is
>>> certainly good news in terms of how much storage we need to provision.
>>>  However, because of the dramatic reduction, I also would like to make sure
>>> it is absolutely correct before submitting it - and also get a sense of why
>>> there was such a difference. -- I know cassandra 1.0 does data compression,
>>> but does the schema change from supercolumn to regular column also help
>>> reduce storage usage?  Thanks.
>>>
>>>  -- Y.
>>>
>>
>>
>>
>
>

Mime
View raw message