cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <>
Subject Re: Selecting rows efficiently from a Cassandra CF containing time series data
Date Tue, 11 Dec 2012 20:48:42 GMT
Couple of ideas, one is to multiplex the even log stream (using flume or kafka) and feed it
straight into your secondary system. The event system should allow you to rate limit inserts
if that is a concern. 

The other is to use partitioning.

Group the log entries per user into some sensible partition, e.g. per day or per week. So
your row key is "user_id : partition_start". 

You can then keep a record of dirty partitions, this can be tricky depending on scale. It
could be a row for each user, and a column for each dirty partition. Loading the delta then
requires a range scan over the dirty partitions CF to read all rows, and then reading the
dirty partition for the user. You would want to look at a low GC Grace and LDB for the dirty
partitions CF. 

Hope that helps.  

Aaron Morton
Freelance Cassandra Developer
New Zealand


On 12/12/2012, at 7:20 AM, "Hiller, Dean" <> wrote:

> Wide rows does not work well if you start getting past 10,000,000 columns though so be
very very careful there.  PlayOrm does some wide row indices for us and each row length is
as large as the number of rows in a partition so without playorm you could do partitioning
yourself by the way….It's as simple as store every row and add to the partitions index.
> Later,
> Dean
> From: Andrey Ilinykh <<>>
> Reply-To: "<>" <<>>
> Date: Tuesday, December 11, 2012 10:45 AM
> To: "<>" <<>>
> Subject: Re: Selecting rows efficiently from a Cassandra CF containing time series data
> would consider to use wide rows. If you add timestamp to your column name you have naturally
sorted data. You can easily select any time range without any indexes.

View raw message