I need to build a system that stores data for years, so yes, I am backing up data in another mass storage system from where it could be later accessed. The data that I successfully back up has to be deleted from my cluster to make space for new data coming in.

I was aware about the snapshotting which I will use for getting the data out of node2: it creates hard links to the SSTables of a CF and then I can copy over those files pointed to by the hard links into another location. After that I get rid of the snapshot (hard links) and then I can truncate my CFs. It's clear that snapshotting will give me a single copy of the data in case I have a unique copy of the data on one node. It's not clear to me what happens if I have let's say a cluster with 3 nodes and RF=2 and I do a snapshot of every node and copy those snapshots to remote storage. Will I get a single copy of the data in the remote storage or will it be twice the data (data + replica)?

I've started reading about TTL and I think I can use it but it's not clear to me how it would work in conjunction with the snapshotting/backing up I need to do. I mean, it will impose a deadline by which I need to perform a backup in order not to miss any data. Also, I might duplicate the data if some columns don't expire fully between 2 backups. Any clarifications on this?


On Tue, Jan 3, 2012 at 9:44 AM, aaron morton <aaron@thelastpickle.com> wrote:
That sounds a little complicated. 

Do you want to get the data out for an off node backup or is it for processing in another system ? 

You may get by using:

* TTL to expire data via compaction
* snapshots for backups


Aaron Morton
Freelance Developer

On 3/01/2012, at 11:00 AM, Alexandru Sicoe wrote:

Hi everyone and Happy New Year!

I need advice for organizing data flow outside of my 3 node Cassandra 0.8.6 cluster. I am configuring my keyspace to use the NetworkTopologyStrategy. I have 2 data centers each with a replication factor 1 (i.e. DC1:1; DC2:1) the configuration of the PropertyFileSnitch is:
I assign tokens like this:
                        node1 = 0
                        node2 = 1
                        node3 = 85070591730234615865843651857942052864

My write consistency level is ANY.

My data sources are only inserting data in node1 & node3. Essentially what happens is that a replica of every input value will end up on node2. Node 2 thus has a copy of the entire data written to the cluster. When Node2 starts getting full, I want to have a script which pulls it off-line and does a sequence of operations (compaction/snapshotting/exporting/truncating the CFs) in order to back up the data in a remote place and to free it up so that it can take more data. When it comes back on-line it will take hints from the other 2 nodes.

This is how I plan on shipping data out of my cluster without any downtime or any major performance penalty. The problem is when I want to also truncate the CFs in node1 & node3 to also free them up of data. I don't know whether I can do this without any downtime or without any serious performance penalties. Is anyone using truncate to free up CFs of data? How efficient is this?

Any observations or suggestions are much appreciated!