cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jens Rantil <>
Subject Re: best practices for time-series data with massive amounts of records
Date Tue, 03 Mar 2015 12:32:24 GMT

I have not done something similar, however I have some comments:

On Mon, Mar 2, 2015 at 8:47 PM, Clint Kelly <> wrote:

> The downside of this approach is that we can no longer do a simple
> continuous scan to get all of the events for a given user.

Sure, but would you really do that real time anyway? :) If you have
billions of events that's not going to scale anyway. Also, if you have
100000 events per bucket. The latency introduced by batching should be

> Some users may log lots and lots of interactions every day, while others
> may interact with our application infrequently,

This makes another reason to split them up into bucket to make the cluster
partitions more manageble and homogenous.

> so I'd like a quick way to get the most recent interaction for a given
> user.

For this you could actually have a second table that stores the
last_time_bucket for a user. Upon event write, you could simply do an
update of the last_time_bucket. You could even have an index of all time
buckets per user if you want.

> Has anyone used different approaches for this problem?
> The only thing I can think of is to use the second table schema described
> above, but switch to an order-preserving hashing function, and then
> manually hash the "id" field.  This is essentially what we would do in
> HBase.

Like you might already know, this order preserving hashing is _not_
considered best practise in the Cassandra world.


Jens Rantil
Backend engineer
Tink AB

Phone: +46 708 84 18 32

Facebook <!/> Linkedin
 Twitter <>

View raw message