cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Praveen Baratam <>
Subject Re: Row or Supercolumn with approximately n columns
Date Tue, 03 Jan 2012 06:27:38 GMT
I understand that there will be contention regarding which *n* columns are
the current *n* columns but as mentioned previously the goal is to limit
the accumulation of data as in our use-case some row keys can receive
fairly heavy inserts. For people requiring precise set of current columns
that feature can be implemented by having a buffer of *m* columns above the
*n columns * so that they can filter in the client.

I believe this approach will not tax cassandra in terms of performance.

Coming to TTL based columns, its difficult to store last *n* samples in
this approach. If the inserts are happening at a constant/predictable rate
then we can achieve the desired functionality using TTL but if inserts are
event driven, then there is no way we can see the last *n* samples after
TTL. This may not be desirable in many use-cases including mine.

Another approach could be a cron job that reads all the rows and slices
every row to first *n* columns using batch_mutate. For this to be efficient
we need an efficient way to query for rows with more than n columns. This
could be a quick externally managed compaction if the performance penalty
can be minimized by some internal api provisions.

I have also opened the above ticket to collect ideas to solve this problem.
Sadly no activity yet.

Coming to custom compaction for this purpose a levelled compaction with
only 2 levels or just one could be enough as rows are not meant to grow
huge and most rows have similar number and sized columns.


On Tue, Jan 3, 2012 at 4:29 AM, aaron morton <>wrote:

> During compaction, both automatic / minor and manual / major.
> The performance drop is having a lot of expired columns that have not been
> purged by compaction as they must be read and discarded during reads.
> Cheers
>   -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> On 3/01/2012, at 10:38 AM, R. Verlangen wrote:
> @Aaron: Small side question, when do columns with a past TTL get removed?
> On a repair, (minor) compaction, or .. ? Does it have a performance drop if
> that's happening?
> 2012/1/2 aaron morton <>
>> Even if you had compaction enforcing a limit on the number of columns in
>> a row, there would still be issues with concurrent writes at the same time
>> and with read-repair. i.e. node a says the this is the first n columns but
>> node b says something else, you only know who is correct at read time.
>> Have you considered using a TTL on the columns ?
>> Depending on the use case you could also consider have writes
>> periodically or randomly trim the data size, or trim on reads.
>> It will also make sense to partition the time series data into different
>> rows, and Viva la Standard Column Families!
>> Hope that helps.
>>   -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> On 25/12/2011, at 7:48 PM, Praveen Baratam wrote:
>> Hello Everybody,
>> Happy Christmas.
>> I know that this topic has come up quiet a few times on Dev and User
>> lists but did not culminate into a solution.
>> The above discussion on User list talks about AbstractCompactionStrategy
>> but I could not find any relevant documentation as its a fairly new feature
>> in Cassandra.
>> Let me state this necessity and use-case again.
>> I need a ColumnFamily (CF) wide or SuperColumn (SC) wide option to
>> approximately limit the number of columns to "n". "n" can vary a lot and
>> the intention is to throw away stale data and not to maintain any hard
>> limit on the CF or SC. Its very useful for storing time-series data where
>> stale data is not necessary. The goal is to achieve this with minimum
>> overhead and since compaction happens all the time it would be clever to
>> implement it as part of compaction.
>> Thanks in advance.
>> Praveen

View raw message