cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Clint Kelly <clint.ke...@gmail.com>
Subject best practices for time-series data with massive amounts of records
Date Mon, 02 Mar 2015 19:47:25 GMT
Hi all,

I am designing an application that will capture time series data where we
expect the number of records per user to potentially be extremely high.  I
am not sure if we will eclipse the max row size of 2B elements, but I
assume that we would not want our application to approach that size anyway.

If we wanted to put all of the interactions in a single row, then I would
make a data model that looks like:

CREATE TABLE events (
  id text,
  event_time timestamp,
  event blob,
  PRIMARY KEY (id, event_time))
WITH CLUSTERING ORDER BY (event_time DESC);

The best practice for breaking up large rows of time series data is, as I
understand it, to put part of the time into the partitioning key (
http://planetcassandra.org/getting-started-with-time-series-data-modeling/):

CREATE TABLE events (
  id text,
  date text, // Could also use year+month here or year+week or something
else
  event_time timestamp,
  event blob,
  PRIMARY KEY ((id, date), event_time))
WITH CLUSTERING ORDER BY (event_time DESC);

The downside of this approach is that we can no longer do a simple
continuous scan to get all of the events for a given user.  Some users may
log lots and lots of interactions every day, while others may interact with
our application infrequently, so I'd like a quick way to get the most
recent interaction for a given user.

Has anyone used different approaches for this problem?

The only thing I can think of is to use the second table schema described
above, but switch to an order-preserving hashing function, and then
manually hash the "id" field.  This is essentially what we would do in
HBase.

Curious if anyone else has any thoughts.

Best regards,
Clint

Mime
View raw message