Couple of ideas, one is to multiplex the even log stream (using flume or kafka) and feed it straight into your secondary system. The event system should allow you to rate limit inserts if that is a concern. 

The other is to use partitioning.

Group the log entries per user into some sensible partition, e.g. per day or per week. So your row key is "user_id : partition_start". 

You can then keep a record of dirty partitions, this can be tricky depending on scale. It could be a row for each user, and a column for each dirty partition. Loading the delta then requires a range scan over the dirty partitions CF to read all rows, and then reading the dirty partition for the user. You would want to look at a low GC Grace and LDB for the dirty partitions CF. 

Hope that helps.  

Aaron Morton
Freelance Cassandra Developer
New Zealand


On 12/12/2012, at 7:20 AM, "Hiller, Dean" <> wrote:

Wide rows does not work well if you start getting past 10,000,000 columns though so be very very careful there.  PlayOrm does some wide row indices for us and each row length is as large as the number of rows in a partition so without playorm you could do partitioning yourself by the way….It's as simple as store every row and add to the partitions index.


From: Andrey Ilinykh <<>>
Reply-To: "<>" <<>>
Date: Tuesday, December 11, 2012 10:45 AM
To: "<>" <<>>
Subject: Re: Selecting rows efficiently from a Cassandra CF containing time series data

would consider to use wide rows. If you add timestamp to your column name you have naturally sorted data. You can easily select any time range without any indexes.