cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CASSANDRA-2815) Bad timing in repair can transfer data it is not suppose to
Date Thu, 23 Jun 2011 07:53:47 GMT
Bad timing in repair can transfer data it is not suppose to 
------------------------------------------------------------

                 Key: CASSANDRA-2815
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2815
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.8.0, 0.7.0
            Reporter: Sylvain Lebresne
            Assignee: Sylvain Lebresne


The core of the problem is that the sstables used to construct a merkle tree are not necessarily
the same than the ones for which streaming is initiated. This is usually not a big deal: newly
compacted sstables don't matter since data hasn't change. Newly flushed data (between the
start of the validation compaction and the start of the streaming) are a little bit more problematic,
even though one could argue that on average this won't be not too much data.

But there can be more problematic scenario: suppose a 3 nodes cluster with RF=3, all the data
on node3 is removed and then repair is started on node3.  Also suppose the cluster havs two
CFs, cf1 and cf2, sharing evenly the data.
Node3 will request the usual merkle trees and let's pretend validation compaction is mono-threaded
to simplify.
What can happen is the following:
  # node3 computes its merkle trees for all requests very quickly, having no data. It will
thus wait on the other nodes trees (his own trees saying "I have no data").
  # node1 starts computing its tree for say cf1 (queuing the computation for cf2). In the
meantime, node2 may start computing the tree for cf2 (queuing the one for cf1).
  # when node1 completes his first tree, it sends it to node3. Node3 receives it, compare
to his own tree and initiate transfer of all the data for cf1 from node1 to him.
  # not too long after that, node2 completes it's first tree, the one for cf2 and send it
to node3. Based on it, transfer of all the data for cf2 from node2 to node3 starts.
  # An arbitrary long time after that (compaction validation can take time), node2 will finish
his second tree (for cf1 that is and send it back to node3. Node3 will compare to his own
(empty) tree and decide that all the ranges should be repaired with node2 for cf1. The problem
is that when that happens, the transfer of cf1 from node1 may have been done already, or at
least partly done. For that reason, node3 will start streaming all this data to node2.
  # For the same reasons, node3 may end up transfering all or part of cf2 to node1.

So the problem is, even though at the beginning node1 and node2 may be perfectly consistent,
we will end up streaming potentially huge amount of data to them.

I think this affects both 0.7 and 0.8, though differently because compaction multi-threading
and the fact that 0.8 differentiates the ranges make the timing different.

One solution (in a way, the theoretically right solution) would be to grab references to the
sstables we use for the validation compaction, and only initiate streaming on these sstables.
However, in 0.8 where compaction is multi-threaded, this would mean retaining compacted sstables
longer that necessary. This is also a bit complicated and in particular would require a network
protocol change (because a streaming request message would have to contain some information
allowing to decide what set of sstables to use).

A maybe simpler solution could be to have the node coordinating the repair wait for the tree
of all the remotes (in this obviously only for a given column family and range) before starting
streaming.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message