incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tyler Hobbs <ty...@datastax.com>
Subject Re: what is the best data model for time series of small data chunks...
Date Tue, 10 Jul 2012 18:23:04 GMT
On Tue, Jul 10, 2012 at 12:14 PM, Roland Hänel <roland@haenel.me> wrote:

> Hi,
>
> I have an application that consists of multiple (possible 1000's) of
> measurement series, and each measurement series generates a small amount of
> data output (only about 500 bytes) every 10 seconds. This time series of
> data should be stored in Cassandra in a fashion that both read access is
> possible for a given time range.
>
> What I do today is
>    - assign a timeuuid to each data output
>    - write in two CF:
>          - first CF has key = measurement series ID, column name =
> timeuuid_of_output
>          - second CF has key = timeuuid_of_output, column value = data
> output (~ 500 bytes)
>
> When someone requests a time range of data, I read the first CF, get a
> series of timeuuid's, and then do a row-multiget on the second CF.
>
> This works great, but tends to be slow for big series of data (lets say
> for 10 days, nearly 100,000 records will be requested from the second CF).
> This load of 100,000 reads will be distributed through the cluster (because
> the second CF scales very nicely with a RandomPartitioner), but more or
> less one ends up with 100,000 individual read requests, at least that's
> what I suspect.
>
> Can anyone say if there is a better data model for this type of queries?
> Would it be a reasonable improvement to put all data to a single CF with
>
>    - single CF, key = measurement series ID, column name =
> timeuuid_of_output, column value = data output
>
> When I request a series of 100,000 columns from this row (now it's a
> single row), can the performance really be better? Is there any chance that
> Cassandra will be able to read this data "en bloc" from the hard drive?
>

This is definitely the approach I would take.  Reading a single row is
nearly sequential, so you'll get very good performance.

I recommend you check these out:

- http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/
- http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra

-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Mime
View raw message