cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: column bloat
Date Tue, 10 May 2011 23:06:27 GMT
> For a reasonable large amount of use cases (for me, 2 out of 3 at the moment) supercolumns
will be units of data where the columns (attributes) will never change by themselves or where
the data does not change anyway (archived data).

Can you use a standard CF and pack the multiple columns into one value in your app ? It sounds
like the super columns are just acting as opaque containers, and cassandra does not need to
know these are different values. Agree this only works if there is no concurrent access on
the sub columns. I'm suggesting this with one eye on https://issues.apache.org/jira/browse/CASSANDRA-2231


> It would seem like a good optimization to allow a timestamp on the supercolumn instead
and remove the one on columns?
> 
> I believe this may also work as an optimization on compactions? Just skip merging of
columns under the supercolumn if the supercolumn has a timestamp and just replace the entire
supercolumn in that case.
> 
> Could be just a variation of the supercolumn object on insert. No timestamp, use the
one in the columns, include timestamp, ignore timestamps in columns.

SC's are more containers than columns, when it comes to reconciling their contents they act
like column families: ask the columns to reconcile respecting the containers tombstone. Giving
the SC a timestamp and making them act like columns would be a major change. 

 A
   
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 11 May 2011, at 03:30, Terje Marthinussen wrote:

> 
> Anyway, to sum that up, expiring columns are 1 byte more and
> non-expiring ones are 7 bytes
> less. Not arguing, it's still fairly verbose, especially with tons of
> very small columns.
> 
> Yes, you are right, sorry.
> Trying to do one thing to many at the same time. 
> My brain filtered out part of the "else if".
>  
> 
> > - inherit timestamps from the supercolumn
> 
> Columns inside a supercolumn have no reason to share the same timestamp (or
> even close ones for that matter). But maybe you're talking about something more
> subtle, in which case yes there is ways to compress the data.
> 
> For a reasonable large amount of use cases (for me, 2 out of 3 at the moment) supercolumns
will be units of data where the columns (attributes) will never change by themselves or where
the data does not change anyway (archived data).
> 
> It would seem like a good optimization to allow a timestamp on the supercolumn instead
and remove the one on columns?
> 
> I believe this may also work as an optimization on compactions? Just skip merging of
columns under the supercolumn if the supercolumn has a timestamp and just replace the entire
supercolumn in that case.
> 
> Could be just a variation of the supercolumn object on insert. No timestamp, use the
one in the columns, include timestamp, ignore timestamps in columns.
> 
> If that sounds like a sensible idea, I may be tempted to try to get time to implement
it. 
> 
> I am also tempted to do some other things like make some of the "ints" and "shorts" variable
length as well.
> 
> Terje


Mime
View raw message