incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jaakko <>
Subject moving & pending ranges & repair
Date Fri, 20 Nov 2009 13:37:27 GMT
Hi all,

I've been playing around quite a lot today with simultaneous
bootstraps and tokenmetadata and getting a hang of how things work /
should work. I think the mechanism will need some refactoring for 0.9
in order to support automatic load balancing. As it stands now, things
get very tricky the moment more than one node gets on the move
simultaneously and we'd probably shoot ourselves to the leg if we do
this automatically now. There seems to always be a loophole where a
particularly bizarre and badly timed network partition can squeeze
through and make the cluster inconsistent. Anyway, I'll continue to
work on this and we'll see what we have time to do within 0.5.

I'm not very familiar with Cassandra's actual storing mechanism
(compaction & data streaming especially), so I'm not sure what
scenarios we have to absolutely avoid, and what can be cleaned up
afterward, so there are a couple of related questions in my mind:

Is there a mechanism to delete records that are no longer in our
range? That is, when we're moving, updating pending ranges is not
atomic and writes might go only to the "old" destinations and not to
the one indicated by pending ranges. When node's movement is over, it
is very much possible that it does not have the newest data on its
range. I suppose we rely on proactive repair to get us the records we
missed during the movement, but what about stuff that a node might
have outside its range? When we leave a range and delete stuff in it,
it will take some time before that information gets propagated to all
nodes and writes during this time will still be directed to the
leaving node.

Another question is about replication factor 1: If a node moves when
replication factor is one, how to make sure it has all newest data
when it finishes the movement? Does table streaming make sure that all
data that arrived during the streaming will also be in the destination
node? As mentioned above, pending ranges is not foolproof, so the only
one that can make sure is the one doing streaming, but in this case
what is the purpose of pending ranges?

One more question about pending ranges (which is related to the
above): What actually are the requirements for pending ranges? It is
next to impossible to make them to 100% match the actual node
movement, so we must rely on repair functionality to some extent in
any case. The question is, are pending ranges more like an
optimization / best effort to leave less work for the repair
department or do they need to make some guarantees about write
destinations when the node is moving (like: "It is OK to write too
long time to the node leaving a range, but not a single write must be
missed by the node that is assuming the range", or "It is OK for the
pending range to be too large, but it must never be smaller than the
range the node is finally assuming after bootstrap (final range to be
assumed might change due to other nodes moving at the same time
nearby)" and other similar restrictions)


View raw message