cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Cassandra Wiki] Update of "MemtableSSTable" by JonathanEllis
Date Mon, 08 Feb 2010 19:44:52 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "MemtableSSTable" page has been changed by JonathanEllis.
http://wiki.apache.org/cassandra/MemtableSSTable

--------------------------------------------------

New page:
Cassandra writes are first written to the !CommitLog, and then to a per-!ColumnFamily structure
called a Memtable.  A Memtable is basically a write-back cache of data rows that can be looked
up by key -- that is, unlike a write-through cache, writes are batched up in the Memtable
until it is full, before being written to disk as an SSTable.

The process of turning a Memtable into a SSTable is called flushing.  You can manually trigger
flush via jmx (e.g. with bin/nodetool), which you may want to do before restarting nodes since
it will reduce !CommitLog replay time.  Memtables are sorted by key and then written out sequentially.

Thus, writes are extremely fast, costing only a commitlog append and an amortized sequential
write for the flush!

Once flushed, SSTable files are immutable; no further writes may be done.  So, on the read
path, the server must (potentially, although it uses tricks like bloom filters to avoid doing
so unnecessarily) combine row fragments from all the SSTables on disk, as well as any unflushed
Memtables, to produce the requested data.

To bound the number of SSTable files that must be consulted on reads, and to reclaim [[DistributedDeletes|space
taken by unused data]], Cassandra performs compactions: merging multiple old SSTable files
into a single new one.  Since the input SSTables are all sorted by key, merging can be done
efficiently, still requiring no random i/o.  Once compaction is finished, the old SSTable
files may be deleted: note that in the worst case (a workload consisting of no overwrites
or deletes) this will temporarily require 2x your existing on-disk space used.  In today's
world of multi-TB disks this is usually not a problem but it is good to keep in mind when
you are setting alert thresholds.

(The high-level memtable/sstable design as well as the "Memtable" and "SSTable" names come
from Cassandra's sections 5.3 and 5.4 of [[http://labs.google.com/papers/bigtable.html|Google's
Bigtable paper]], although some of the terminology around compaction differs.)

Mime
View raw message