incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJ ...@dude.podzone.net>
Subject Backups, Snapshots, SSTable Data Files, Compaction
Date Tue, 07 Jun 2011 05:03:46 GMT
Hi,

I am working on a backup strategy and am trying to understand what is 
going on in the data directory.

I notice that after a write to a CF and then flush, a new set of data 
files are created with an index number incremented in their names, such as:

Initially:
Users-e-1-Filter.db
Users-e-1-Index.db
Users-e-1-Statistics.db

Then, after a write to the Users CF, followed by a flush:
Users-e-2-Filter.db
Users-e-2-Index.db
Users-e-2-Statistics.db

Currently, my data dir has about 16 sets.  I thought that compaction 
(with nodetool) would clean-up these files, but it doesn't.  Neither 
does cleanup or repair.

Q1: Should the files with the lower index #'s (under the data/{keyspace} 
directory) be manually deleted?  Or, do ALL of the files in this 
directory need to be backed-up?

Q2: Can someone elaborate on the structure of these files and if they 
are interrelated?  I'm guessing that maybe each incremental set is like 
an incremental or differential backup of the SSTable, but I'm not sure.  
The reason I ask is because I hope that each set isn't a full copy of 
the data, eg, if my data set size for a given CF is 1 TB, I will not end 
up with 16 TB worth of data files after 16 calls to flush... I suspect 
not, but I'm just double-checking ;o)

Q3: When and how are these extra files removed or reduced?

Thanks!

Mime
View raw message