incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeremiah Jordan" <JEREMIAH.JOR...@morningstar.com>
Subject RE: best way to backup
Date Sat, 30 Apr 2011 13:15:47 GMT
The files inside the keyspace folders are the SSTable.

________________________________

From: aaron morton [mailto:aaron@thelastpickle.com] 
Sent: Friday, April 29, 2011 4:49 PM
To: user@cassandra.apache.org
Subject: Re: best way to backup


William,  
Some info on the sstables from me
http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/

<http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/> If you
want to know more check out the BigTable and original Facebook papers,
linked from the wiki

<http://wiki.apache.org/cassandra/ArchitectureOverview> Aaron

On 29 Apr 2011, at 23:43, William Oberman wrote:


	Dumb question, but referenced twice now: which files are the
SSTables and why is backing them up incrementally a win? 

	Or should I not bother to understand internals, and instead just
roll with the "backup my keyspace(s) and system in a compressed tar"
strategy, as while it may be excessive, it's guaranteed to work and work
easily (which I like, a great deal).

	will
	
	
	On Fri, Apr 29, 2011 at 4:58 AM, Daniel Doubleday
<daniel.doubleday@gmx.net> wrote:
	

		What we are about to set up is a time machine like
backup. This is more like an add on to the s3 backup. 

		Our boxes have an additional larger drive for local
backup. We create a new backup snaphot every x hours which hardlinks the
files in the previous snapshot (bit like cassandras incremental_backups
thing) and than we sync that snapshot dir with the cassandra data dir.
We can do archiving / backup to external system from there without
impacting the main data raid.

		But the main reason to do this is to have an 'omg we
screwed up big time and deleted / corrupted data' recovery.

		On Apr 28, 2011, at 9:53 PM, William Oberman wrote:


			Even with N-nodes for redundancy, I still want
to have backups.  I'm an amazon person, so naturally I'm thinking S3.
Reading over the docs, and messing with nodeutil, it looks like each new
snapshot contains the previous snapshot as a subset (and I've read how
cassandra uses hard links to avoid excessive disk use).  When does that
pattern break down?  
			
			I'm basically debating if I can do a "rsync"
like backup, or if I should do a compressed tar backup.  And I obviously
want multiple points in time.  S3 does allow file versioning, if a file
or file name is changed/resused over time (only matters in the rsync
case).  My only concerns with compressed tars is I'll have to have free
space to create the archive and I get no "delta" space savings on the
backup (the former is solved by not allowing the disk space to get so
low and/or adding more nodes to bring down the space, the latter is
solved by S3 being really cheap anyways).
			
			-- 
			Will Oberman
			Civic Science, Inc.
			3030 Penn Avenue., First Floor
			Pittsburgh, PA 15201
			(M) 412-480-7835
			(E) oberman@civicscience.com
			





	-- 
	Will Oberman
	Civic Science, Inc.
	3030 Penn Avenue., First Floor
	Pittsburgh, PA 15201
	(M) 412-480-7835
	(E) oberman@civicscience.com
	



Mime
View raw message