That sounds a little complicated. 

Do you want to get the data out for an off node backup or is it for processing in another system ? 

You may get by using:

* TTL to expire data via compaction
* snapshots for backups


Aaron Morton
Freelance Developer

On 3/01/2012, at 11:00 AM, Alexandru Sicoe wrote:

Hi everyone and Happy New Year!

I need advice for organizing data flow outside of my 3 node Cassandra 0.8.6 cluster. I am configuring my keyspace to use the NetworkTopologyStrategy. I have 2 data centers each with a replication factor 1 (i.e. DC1:1; DC2:1) the configuration of the PropertyFileSnitch is:
I assign tokens like this:
                        node1 = 0
                        node2 = 1
                        node3 = 85070591730234615865843651857942052864

My write consistency level is ANY.

My data sources are only inserting data in node1 & node3. Essentially what happens is that a replica of every input value will end up on node2. Node 2 thus has a copy of the entire data written to the cluster. When Node2 starts getting full, I want to have a script which pulls it off-line and does a sequence of operations (compaction/snapshotting/exporting/truncating the CFs) in order to back up the data in a remote place and to free it up so that it can take more data. When it comes back on-line it will take hints from the other 2 nodes.

This is how I plan on shipping data out of my cluster without any downtime or any major performance penalty. The problem is when I want to also truncate the CFs in node1 & node3 to also free them up of data. I don't know whether I can do this without any downtime or without any serious performance penalties. Is anyone using truncate to free up CFs of data? How efficient is this?

Any observations or suggestions are much appreciated!