From Steven Mac <>
Subject Advice wanted on modeling
Date Tue, 11 Jan 2011 18:07:05 GMT


I've been experimenting quite a bit with Cassandra and think I'm getting to understand it,
but I would like some advice on modeling my data in Cassandra for an application I'm developing.

The application will have a large number of records, with the records consisting of a fixed
part and a number (n) of periodic parts.
* The fixed part is updated occasionally.
* The periodic parts are never updated, but a new one is added every 5 to 10 minutes. Only
the last n periodic parts need to be kept, so that the oldest one can be deleted after adding
a new part.
* The records will always be read completely (meaning fixed part and all periodic parts).
Reads are less frequent than writes.
The application will be running continuosly, at least for a few weeks, so there will be many,
many stale periodic parts, so I'm a bit worried about data comsumption and compactions.

With respect to modeling the above in Cassandra I have the following questions:

Does anyone want to provide insights into the alternatives below:

1) For every period, add a new column to each record and delete the oldest column with a batch_mutate.
This obviously causes many tombstones.
2) For every period, overwrite the oldest column for each record with the new one (cyclic/modulo
behaviour). AFAIK this does not cause any tombstones, but will probably cause the SSTables
to get polluted.
3) (0.7 only) For every period, create a new CF and add columns to it with a batch_mutate
and drop the oldest CF. The obsolete data can be cleaned up immediately, but I'm not sure
if this is proper/recommended use of dynamic CFs.
4) Don't use Cassandra at all and investigate other storage solutions. Suggestions would be
welcome if you favour this approach.

Also I'm wondering whether I should be putting the fixed and periodic parts together in one
Super CF, or whether it would be better to separate the fixed part into one CF and the periodic
parts in another. Since I'll be reading all data of a record at the same time, my preference
would go to a Super CF, but I'm open to anyone wanting to talk me out of this ;-)

Thanks, Steven.
