Starfish loves you.

On Wed, May 5, 2010 at 1:16 PM, David Strauss <> wrote:
On 2010-05-05 04:50, Denis Haskin wrote:
> I've been reading everything I can get my hands on about Cassandra and
> it sounds like a possibly very good framework for our data needs; I'm
> about to take the plunge and do some prototyping, but I thought I'd
> see if I can get a reality check here on whether it makes sense.
> Our schema should be fairly simple; we may only keep our original data
> in Cassandra, and the rollups and analyzed results in a relational db
> (although this is still open for discussion).

This is what we do on some projects. This is a particularly nice
strategy if the raw : aggregated ratio is really high or the raw data is
bursty or highly volatile.

Consider Hadoop integration for your aggregation needs.

> We have fairly small records: 120-150 bytes, in maybe 18 columns.
> Data is additive only; we would rarely, if ever, be deleting data.

Cassandra loves you.

> Our core data set will accumulate at somewhere between 14 and 27
> million rows per day; we'll be starting with about a year and a half
> of data (7.5 - 15 billion rows) and eventually would like to keep 5
> years online (25 to 50 billion rows).  (So that's maybe 1.3TB or so
> per year, data only.  Not sure about the overhead yet.)
> Ideally we'd like to also have a cluster with our complete data set,
> which is maybe 38 billion rows per year (we could live with less than
> 5 years of that).
> I haven't really thought through what the schema's going to be; our
> primary key is an entity's ID plus a timestamp.  But there's 2 or 3
> other retrieval paths we'll need to support as well.

Generally, you do multiple retrieval paths through denormalization in

> Thoughts?  Pitfalls?  Gotchas? Are we completely whacked?

Does the random partitioner support what you need?

David Strauss
Four Kitchens
  | +1 512 454 6659 [office]
  | +1 512 870 8453 [direct]