incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Terje Marthinussen <>
Subject Re: column bloat
Date Wed, 11 May 2011 08:54:44 GMT
On Wed, May 11, 2011 at 8:06 AM, aaron morton <>wrote:

> For a reasonable large amount of use cases (for me, 2 out of 3 at the
> moment) supercolumns will be units of data where the columns (attributes)
> will never change by themselves or where the data does not change anyway
> (archived data).
> Can you use a standard CF and pack the multiple columns into one value in
> your app ? It sounds like the super columns are just acting as opaque
> containers, and cassandra does not need to know these are different values.
> Agree this only works if there is no concurrent access on the sub columns.
> I'm suggesting this with one eye on
I have a great interest in sharing data across applications using cassandra.
This means I also have a great interest in removing serialization from the
applications :)
That I can get reasonably far without serialization logic in the application
is one of the main reasons I am working on Cassandra.

Yes, I have had this discussion before so I know the next suggestion would
be to build an API on top doing the serialization, but that will further
complicate things if I want to integrate with hadoop or other similar tools,
so why should I if I don't have to? :)

> It would seem like a good optimization to allow a timestamp on the
> supercolumn instead and remove the one on columns?
> I believe this may also work as an optimization on compactions? Just skip
> merging of columns under the supercolumn if the supercolumn has a timestamp
> and just replace the entire supercolumn in that case.
> Could be just a variation of the supercolumn object on insert. No
> timestamp, use the one in the columns, include timestamp, ignore timestamps
> in columns.
> SC's are more containers than columns, when it comes to reconciling their
> contents they act like column families: ask the columns to reconcile
> respecting the containers tombstone. Giving the SC a timestamp and making
> them act like columns would be a major change.

Not so sure it would be a major change, but if we can make an assumption
that people (or APIs) will be smart enough to feed data where all columns
has the same timestamp if they want to save some disk,  I guess this can be
compressed quite efficiently anyway.


View raw message