incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Haddad <...@jonhaddad.com>
Subject Re: cassandra backup
Date Fri, 06 Dec 2013 16:05:48 GMT
I believe SSTables are written to a temporary file then moved.  If I
remember correctly, tools like tablesnap listen for the inotify event
IN_MOVED_TO.  This should handle the "try to back up sstable while in
mid-write" issue.


On Fri, Dec 6, 2013 at 5:39 AM, Michael Theroux <mtheroux2@yahoo.com> wrote:

> Hi Marcelo,
>
> Cassandra provides and eventually consistent model for backups.  You can
> do staggered backups of data, with the idea that if you restore a node, and
> then do a repair, your data will be once again consistent.  Cassandra will
> not automatically copy the data to other nodes (other than via hinted
> handoff).  You should manually run repair after restoring a node.
>
> You should take snapshots when doing a backup, as it keeps the data you
> are backing up relevant to a single point in time, otherwise compaction
> could add/delete files one you mid-backup, or worse, I imagine attempt to
> access a SSTable mid-write.  Snapshots work by using links, and don't take
> additional storage to perform.  In our process we create the snapshot,
> perform the backup, and then clear the snapshot.
>
> One thing to keep in mind in your S3 cost analysis is that, even though
> storage is cheap, reads/writes to S3 are not (especially writes).  If you
> are using LeveledCompaction, or otherwise have a ton of SSTables, some
> people have encountered increased costs moving the data to S3.
>
> Ourselves, we maintain backup EBS volumes that we regularly snaphot/rsync
> data too.  Thus far this has worked very well for us.
>
> -Mike
>
>
>   On Friday, December 6, 2013 8:14 AM, Marcelo Elias Del Valle <
> marcelo@s1mbi0se.com.br> wrote:
>  Hello everyone,
>
>     I am trying to create backups of my data on AWS. My goal is to store
> the backups on S3 or glacier, as it's cheap to store this kind of data. So,
> if I have a cluster with N nodes, I would like to copy data from all N
> nodes to S3 and be able to restore later. I know Priam does that (we were
> using it), but I am using the latest cassandra version and we plan to use
> DSE some time, I am not sure Priam fits this case.
>     I took a look at the docs:
> http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/operations/../../cassandra/operations/ops_backup_takes_snapshot_t.html
>
>     And I am trying to understand if it's really needed to take a snapshot
> to create my backup. Suppose I do a flush and copy the sstables from each
> node, 1 by one, to s3. Not all at the same time, but one by one.
>     When I try to restore my backup, data from node 1 will be older than
> data from node 2. Will this cause problems? AFAIK, if I am using a
> replication factor of 2, for instance, and Cassandra sees data from node X
> only, it will automatically copy it to other nodes, right? Is there any
> chance of cassandra nodes become corrupt somehow if I do my backups this
> way?
>
> Best regards,
> Marcelo Valle.
>
>
>


-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade

Mime
View raw message