cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <>
Subject [jira] Commented: (CASSANDRA-193) Proactive repair
Date Wed, 10 Jun 2009 04:45:07 GMT


Jonathan Ellis commented on CASSANDRA-193:

To start with the good news: one thing which may seem on the face of it to be a problem, isn't
really.  That is, how do you get nodes replicating a given token range to agree where to freeze
or snapshot the data set to be repaired, in the face of continuing updates?  The answer is,
you don't; it doesn't matter.  If we repair a few columnfamilies that don't really need it
(because one of the nodes was just a bit slower to process an update than the other), that's
no big deal.  We accept that and move on.

The bad news is, I don't see a clever solution for performing broad-based repair against the
Memtable/SSTable model similar to Merkle trees for Dynamo/bdb.  (Of course, that is no guarantee
that none such exists. :)

There are several difficulties.  (In passing, it's worth noting that Bigtable sidesteps these
issues by writing both commit logs and sstables to GFS, which takes care of durability.  Here
we have to do more work in exchange for a simpler model and better performance on ordinary
reads and writes.)

One difficulty lies in how data in one SSTable may be pre-empted by another.  Because of this,
any hash-based "summary" of a row may be obsoleted by rows in another.  For some workloads,
particularly ones in which most keys are updated infrequently, caching such a summary in the
sstable or index file might still be useful, but it should be kept in mind that in the worst
case these will just be wasted effort.

(I think it would be a mistake to address this by forcing a major compaction -- combining
all sstables for the columnfamily into one -- as a prerequisite to repair.  Reading and rewriting
_all_ the data for _each_ repair is a significant amount of extra I/O.)

Another is that token regions do not correspond 1:1 to sstables, because each node is responsible
for N token regions -- the regions for which is is the primary, secondar, tertiary, etc. repository
for -- all intermingled in the SSTable files.  So any precomputation would need to be done
separately N times.

Finally, we can't assume that sstable or even just row key names will fit into the heap, which
limits the kind of in-memory structures we can build.

So from what I do not think it is worth the complexity to attempt to cache per-row hashes
or summaries of the sstable data in the sstable or index files.

So the approach I propose is simply to iterate through the key space on a per-CF basis, compute
a hash, and repair if there is a mismatch.  The code to iterate keys is already there (for
the compaction code) and so is the code to compute hashes and repair if a mismatch is found
(for read repair).  I think it will be worth flushing the current memtable first to avoid
having to take a read lock on it.

Enhancements could include building a merkle tree from each batch of hashes to minimize round
trips -- although unfortunately I think that is not going to be a bottleneck for Cassandra
compared to the hash computation -- and fixing the compaction and hash computation code to
iterate through columns in a CF rather than deserializing each ColumnFamily in its entirety.
 These could definitely be split into separate tickets.

> Proactive repair
> ----------------
>                 Key: CASSANDRA-193
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Jonathan Ellis
>             Fix For: 0.4
> Currently cassandra supports "read repair," i.e., lazy repair when a read is done.  This
is better than nothing but is not sufficient for some cases (e.g. catastrophic node failure
where you need to rebuild all of a node's data on a new machine).
> Dynamo uses merkle trees here.  This is harder for Cassandra given the CF data model
but I suppose we could just hash the serialized CF value.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message