cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maki Watanabe <>
Subject Re: Backups, Snapshots, SSTable Data Files, Compaction
Date Tue, 07 Jun 2011 08:29:45 GMT
You can find useful information in:

sstables are immutable. Once it written to disk, it won't be updated.
When you take snapshot, the tool makes hard links to sstable files.
After certain time, you will have some times of memtable flushs, so
your sstable files will be merged, and obsolete sstable files will be
removed. But snapshot set will remains on your disk, for backup.

Assume you have sstable: A B C D E F,
When you take snapshot, you will have hard links A B C D E F under
snapshots subdirectory.
These hard links = files will not removed even after you run
major/minor compaction.


2011/6/7 AJ <>:
> On 6/6/2011 11:25 PM, Benjamin Coverston wrote:
>>> Currently, my data dir has about 16 sets.  I thought that compaction
>>> (with nodetool) would clean-up these files, but it doesn't.  Neither does
>>> cleanup or repair.
>> You're not even talking about snapshots using nodetool snapshot yet. Also
>> nodetool compact does compact all of the live files, however the compacted
>> SSTables will not be cleaned up until a garbage collection is triggered, or
>> a capacity threshold is met.
> Ok, so after a compaction, Cass is still not done with the older sets of .db
> files and I should let Cass delete them?  But, I thought one of the main
> purposes of compaction was to reclaim disk storage resources.  I'm only
> playing around with a small data set so I can't tell how fast the data
> grows.  I'm trying to plan my storage requirements.  Is each newly-generated
> set as large in size as the previous?
> The reason I ask is it seems a snapshot is...
>>> Q1: Should the files with the lower index #'s (under the data/{keyspace}
>>> directory) be manually deleted?  Or, do ALL of the files in this directory
>>> need to be backed-up?
>> Do not ever delete files in your data directory if you care about data on
>> that replica, unless they are from a column family that no longer exists on
>> that server. There may be some duplicate data in the files, but if the files
>> are in the data directory, as a general rule, they are there because they
>> contain some set of data that is in none of the other SSTables.
> ... It seems a snapshot is implemented, unsurprisingly,  as just a link to
> the latest (highest indexed) set; not the previous sets.  So, obviously,
> only the latest *.db files will get backed-up.  Therefore, the previous sets
> must be worthless.


View raw message