cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Lundin <...@eintr.org>
Subject Re: Use Case scenario: Keeping a window of data + online analytics
Date Mon, 08 Mar 2010 13:44:23 GMT
A few comments on building a time-series store in Cassandra...

Using the timestamp dimension of columns, "reusing" columns, could prove
quite useful. This allows simple use of batch_mutate deletes (new in
0.6) to purge old data outside the active time window.

Otherwise, performance wise, deletes and "updates" are the same in
Cassandra (see
http://spyced.blogspot.com/2010/02/distributed-deletes-in-cassandra.html).

Data should be spread out over the ring, so load distribution is
constant regardless of time or "burst peaks".

A separate location cache, using a counting/timestamped bloom filter
might be useful too, depending on your app, data structures, and
throughput requirements. This should be kept outside cassandra and in
RAM (redis or even memcache would fit nicely, but a simple RPC service
would be faster). Something like such would allow you to build a tuned
sliding-window type cache to ensure reads are minimized.

Rinse, refactor, repeat, until fast enough and/or job is done ...

>     - Can we keep this "data window" approach, or will a high rate of
> delete pose a problem?

Delete and "insert" are both mutations, so if you can do one, you can do
the other in ~ the same time. IOW, your rate of mutations in a
one-in-one-out scenario is simply 2 * insert-rate.

Due to the nature of deletes, you need to plan for storing "deleted"
data until compaction though. The compaction phase itself will probably
need accounting for, but that too is predictable.

>     - We need read speed, I understand writes won't be a problem, but
> there will be a lot of reads, some of them with large sets of values.
>     - What role plays RAM in Cassandra under this scenario?

0.6 has improved caching for reads, but if your app truly needs high
performance reads, some kind of application-tuned cache frontend (as
mentioned above) is not a bad thing. For sliding-window time series,
it's hard to beat a simple bloom-filter based cache without reaching for
complexity.

>     Of course we are looking at Cassandra as a possible solution
> and/or part of the solution, against / or combined with a in memory
> DB.

It's certainly possible to decouple purging from insertion in Cassandra,
but there's no generic "this is how you do it" answer.

This, IMHO, is a good thing though.

/d

Mime
View raw message