cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Morton <>
Subject Re: Super CF or two CFs?
Date Tue, 18 Jan 2011 10:51:25 GMT
Sorry was not suggesting super CF is better in the first para, I think it applies to any CF.

The role of compaction is to (among other things) reduce the number of SSTables for each CF.
The logical endpoint of this process would be a single file for each CF, giving the lowest
possible IO. The volatility of your data (overwrites and new colums for a row) fights against
this process. In reality it will not get to that endstate. Even in the best case I think it
will only go down to 3 sstables. See

If you do have a some data that is highly volatile, and you have performance problems. Then
changing compaction thresholds is a recommended approach I think. See the comments in Cassandra.yaml.

My argument is for you to keep data in one CF if you want to read it together. As always store
the data to serve the read requests. Do some tests and see where your bottle necks may be
for your HW and usage. I may be wrong.

IMHO in this discussion Super or Standard CF will make little performance difference, other
the super CF limitations mentioned.


On 18/01/2011, at 11:14 PM, Steven Mac <> wrote:

> Thanks for the answer. It provides me the insight I'm looking for.
> However, I'm also a bit confused as your first paragraph seems to indicate that using
a SCF is better, whereas the last sentence states just the opposite. Do I interpret correctly
that this is because of the compactions that put all non-volatile data together in one sstable,
leading to compact sstable if the non-volatile data is put into a separate CF? Can this then
be generalised into a rule of thumb to separate non-volatile data from volatile data into
separate CFs, or am I going too far then?
> I will definitely be trying out both suggestions and post my findings.
> Hugo.
> Subject: Re: Super CF or two CFs?
> From:
> Date: Tue, 18 Jan 2011 21:54:25 +1300
> To:
> With regard to overwrites, and assuming you always want to get all the data for a stock
ticker. Any read on the volatile data will potentially touch many sstables, this IO is unavoidable
to read this data so we may as well read as many cols as possible at this time. Whereas if
you split the data into two cf's you would incure all the IO for the volatile data plus IO
for the non volatile, and have to make two calls. (Or use different keys and make a multiget_slice
call, the IO argument still stands)
> Thanks to compaction less volatile data, say cols that are written once a day, week or
month, will be tend to accrete into fewer sstables. To that end it may make sense to schedule
compactions to run after weekly bulk operations. Also take a look at the per CF compaction
> I'd recommend trying one standard CF (with the quotes packed as suggested) to start with,
run some tests and let us know how you go. There are some small penalties to using super Cfs,
see the limitations page on the wiki.
> Hope that helps.
> Aaron
> On 18/01/2011, at 9:29 PM, Steven Mac <> wrote:
> Some of the fields are indeed written in one shot, but others (such as label and categories)
are added later, so I think the question still stands.
> Hugo.
> From:
> Date: Mon, 17 Jan 2011 18:47:28 -0600
> Subject: Re: Super CF or two CFs?
> To:
> On Mon, Jan 17, 2011 at 5:12 PM, Steven Mac <> wrote:
> I guess I was maybe trying to simplify the question too much. In reality I do not have
one volatile part, but multiple ones (say all trading data of day). Each would be a supercolumn
identified by the time slot, with the individual fields as subcolumns.
> If you're always going to write these attributes in one shot, then just serialize them
and use a simple CF, there's no need for a SCF.
> -Brandon

View raw message