cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: emptying my cluster
Date Tue, 03 Jan 2012 19:19:48 GMT
Running a time based rolling window of data can be done using the TTL. Backing up the nodes
for disaster recover can be done using snapshots. Restoring any point in time will be tricky
because to may restore columns where the TTL has expired. 

> Will I get a single copy of the data in the remote storage or will it be twice the data
(data + replica)?
You will  RF copies of the data. (By the way, there is no original copy)

Can you share a bit more about the use case ? How much data and what sort of read patterns
? 

Can you split the data stream into a permanent log record and also into cassandra for a rolling
window of query able data ?   

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 3/01/2012, at 11:41 PM, Alexandru Sicoe wrote:

> Hi,
> 
> I need to build a system that stores data for years, so yes, I am backing up data in
another mass storage system from where it could be later accessed. The data that I successfully
back up has to be deleted from my cluster to make space for new data coming in.
> 
> I was aware about the snapshotting which I will use for getting the data out of node2:
it creates hard links to the SSTables of a CF and then I can copy over those files pointed
to by the hard links into another location. After that I get rid of the snapshot (hard links)
and then I can truncate my CFs. It's clear that snapshotting will give me a single copy of
the data in case I have a unique copy of the data on one node. It's not clear to me what happens
if I have let's say a cluster with 3 nodes and RF=2 and I do a snapshot of every node and
copy those snapshots to remote storage. Will I get a single copy of the data in the remote
storage or will it be twice the data (data + replica)?
> 
> I've started reading about TTL and I think I can use it but it's not clear to me how
it would work in conjunction with the snapshotting/backing up I need to do. I mean, it will
impose a deadline by which I need to perform a backup in order not to miss any data. Also,
I might duplicate the data if some columns don't expire fully between 2 backups. Any clarifications
on this?
> 
> Cheers,
> Alex
> 
> On Tue, Jan 3, 2012 at 9:44 AM, aaron morton <aaron@thelastpickle.com> wrote:
> That sounds a little complicated. 
> 
> Do you want to get the data out for an off node backup or is it for processing in another
system ? 
> 
> You may get by using:
> 
> * TTL to expire data via compaction
> * snapshots for backups
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 3/01/2012, at 11:00 AM, Alexandru Sicoe wrote:
> 
>> Hi everyone and Happy New Year!
>> 
>> I need advice for organizing data flow outside of my 3 node Cassandra 0.8.6 cluster.
I am configuring my keyspace to use the NetworkTopologyStrategy. I have 2 data centers each
with a replication factor 1 (i.e. DC1:1; DC2:1) the configuration of the PropertyFileSnitch
is:
>>                               
>>                                                                    ip_node1=DC1:RAC1
>>                                                                                 
                ip_node2=DC2:RAC1
>>                                                                                 
                ip_node3=DC1:RAC1
>> I assign tokens like this:
>>                         node1 = 0
>>                         node2 = 1
>>                         node3 = 85070591730234615865843651857942052864
>> 
>> My write consistency level is ANY.
>> 
>> My data sources are only inserting data in node1 & node3. Essentially what happens
is that a replica of every input value will end up on node2. Node 2 thus has a copy of the
entire data written to the cluster. When Node2 starts getting full, I want to have a script
which pulls it off-line and does a sequence of operations (compaction/snapshotting/exporting/truncating
the CFs) in order to back up the data in a remote place and to free it up so that it can take
more data. When it comes back on-line it will take hints from the other 2 nodes.
>> 
>> This is how I plan on shipping data out of my cluster without any downtime or any
major performance penalty. The problem is when I want to also truncate the CFs in node1 &
node3 to also free them up of data. I don't know whether I can do this without any downtime
or without any serious performance penalties. Is anyone using truncate to free up CFs of data?
How efficient is this?
>> 
>> Any observations or suggestions are much appreciated!
>> 
>> Cheers,
>> Alex
> 
> 


Mime
View raw message