Do we have a problem from bootstrapping nodes not being aware of each
other in rackaware replication strategy?
Background: bootstrap makes the assumption that we can simplify things
by treating bootstrap of multiple nodes independently, trading some
(potential) extra copying for simplifying the process for recovery if
a node fails or is killed during the bootstrap process.
A couple examples should illustrate this.
Suppose we have nodes A and D in rack unaware mode, replication factor
of one (for simplicity). The ranges are then (DA] for A and (AD]
for D.
Nodes B and C then bootstrap between A and D. So we copy (AB] to B
and (AC] to C. If both bootstraps complete successfully then they
will serve (AB] and (BC], that is, we transferred (AB] to C
unnecessarily. But, if either bootstrap fails, the remaining
bootstrap can ignore that and serve the entire range that was
transferred to it.
So for rackunaware bootstrapping it is clear that
bootstrapinisolation is fine. But what about rackaware?
Recall that in rackaware mode, we write the first replica to the
first node on the ring _in the other data center_, and remaining
replicas to nodes in the same.
Say we have two nodes A and D, in different DCs, with a replication
factor of 2:
A / D
Node Primary range Replica for
A (DA] (AD]
D (AD] (DA]
If we add nodes B and C in the same DCs as A and D, respectively, we
bootstrap as
A,B / C,D
B predicts the ring will be
Node Primary range Replica for
A (DA] (BD]
B (AB]
D (BD] (DA], (AB]
C predicts
Node Primary range Replica for
A (DA] (AC], (CD]
C (AC] (DA]
D (CD]
And really we end up with
Node Primary range Replica for
A (DA] (BC], (CD]
B (AB]
C (BC] (DA], (AB]
D (CD]
So each node does have (a superset of) the right data copied. (Note
that C has (AB] as a replica in the final version, whereas it
predicted it would be part of its primary range, but that doesn't
matter as long as it ended up w/ the right data on it.)
If instead we add B and C both to D's datacenter we have:
A / B,C,D
Node Primary range Replica for
A (DA] (AB], (BD]
B (AB] (DA]
D (BD]
Node Primary range Replica for
A (DA] (AC], (CD]
C (AC] (DA]
D (CD]
Node Primary range Replica for
A (DA] (AB], (BC], (CD]
B (AB] (DA]
C (BC]
D (CD]
Again each node ends up with the right data.
Are there conditions under which we don't?
After playing around with this in my mind I think that there are not,
but this is tricky so peer review is welcome. :)
