cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yiming Sun <yiming....@gmail.com>
Subject Re: data size difference between supercolumn and regular column
Date Wed, 04 Apr 2012 13:19:24 GMT
Cool, I will look into this new leveled compaction strategy and give it a
try.

BTW, Aaron, I think the last word of your message meant to say
"compression", correct?

-- Y.

On Mon, Apr 2, 2012 at 9:37 PM, aaron morton <aaron@thelastpickle.com>wrote:

> If you have a workload with overwrites you will end up with some data
> needing compaction. Running a nightly manual compaction would remove this,
> but it will also soak up some IO so it may not be the best solution.
>
> I do not know if Leveled compaction would result in a smaller disk load
> for the same workload.
>
> I agree with other people, turn on compaction.
>
> Cheers
>
>   -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 3/04/2012, at 9:19 AM, Yiming Sun wrote:
>
> Yup Jeremiah, I learned a hard lesson on how cassandra behaves when it
> runs out of disk space :-S.    I didn't try the compression, but when it
> ran out of disk space, or near running out, compaction would fail because
> it needs space to create some tmp data files.
>
> I shall get a tatoo that says keep it around 50% -- this is valuable tip.
>
> -- Y.
>
> On Sun, Apr 1, 2012 at 11:25 PM, Jeremiah Jordan <
> JEREMIAH.JORDAN@morningstar.com> wrote:
>
>>  Is that 80% with compression?  If not, the first thing to do is turn on
>> compression.  Cassandra doesn't behave well when it runs out of disk space.
>>  You really want to try and stay around 50%,  60-70% works, but only if it
>> is spread across multiple column families, and even then you can run into
>> issues when doing repairs.
>>
>>  -Jeremiah
>>
>>
>>
>>  On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote:
>>
>> Thanks Aaron.  Well I guess it is possible the data files from
>> sueprcolumns could've been reduced in size after compaction.
>>
>>  This bring yet another question.  Say I am on a shoestring budget and
>> can only put together a cluster with very limited storage space.  The first
>> iteration of pushing data into cassandra would drive the disk usage up into
>> the 80% range.  As time goes by, there will be updates to the data, and
>> many columns will be overwritten.  If I just push the updates in, the disks
>> will run out of space on all of the cluster nodes.  What would be the best
>> way to handle such a situation if I cannot to buy larger disks? Do I need
>> to delete the rows/columns that are going to be updated, do a compaction,
>> and then insert the updates?  Or is there a better way?  Thanks
>>
>>  -- Y.
>>
>> On Sat, Mar 31, 2012 at 3:28 AM, aaron morton <aaron@thelastpickle.com>wrote:
>>
>>>   does cassandra 1.0 perform some default compression?
>>>
>>>  No.
>>>
>>>  The on disk size depends to some degree on the work load.
>>>
>>>  If there are a lot of overwrites or deleted you may have rows/columns
>>> that need to be compacted. You may have some big old SSTables that have not
>>> been compacted for a while.
>>>
>>>  There is some overhead involved in the super columns: the super col
>>> name, length of the name and the number of columns.
>>>
>>>  Cheers
>>>
>>>     -----------------
>>> Aaron Morton
>>> Freelance Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>>
>>>  On 29/03/2012, at 9:47 AM, Yiming Sun wrote:
>>>
>>> Actually, after I read an article on cassandra 1.0 compression just now
>>> (
>>> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression),
>>> I am more puzzled.  In our schema, we didn't specify any compression
>>> options -- does cassandra 1.0 perform some default compression? or is the
>>> data reduction purely because of the schema change?  Thanks.
>>>
>>>  -- Y.
>>>
>>> On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yiming.sun@gmail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>>  We are trying to estimate the amount of storage we need for a
>>>> production cassandra cluster.  While I was doing the calculation, I noticed
>>>> a very dramatic difference in terms of storage space used by cassandra data
>>>> files.
>>>>
>>>>  Our previous setup consists of a single-node cassandra 0.8.x with no
>>>> replication, and the data is stored using supercolumns, and the data files
>>>> total about 534GB on disk.
>>>>
>>>>  A few weeks ago, I put together a cluster consisting of 3 nodes
>>>> running cassandra 1.0 with replication factor of 2, and the data is
>>>> flattened out and stored using regular columns.  And the aggregated data
>>>> file size is only 488GB (would be 244GB if no replication).
>>>>
>>>>  This is a very dramatic reduction in terms of storage needs, and is
>>>> certainly good news in terms of how much storage we need to provision.
>>>>  However, because of the dramatic reduction, I also would like to make sure
>>>> it is absolutely correct before submitting it - and also get a sense of why
>>>> there was such a difference. -- I know cassandra 1.0 does data compression,
>>>> but does the schema change from supercolumn to regular column also help
>>>> reduce storage usage?  Thanks.
>>>>
>>>>  -- Y.
>>>>
>>>
>>>
>>>
>>
>>
>
>

Mime
View raw message