Hi,
Current implementation of loadbalance seems to work only for
RackUnaware. Problem is only RackUnaware creates a "flat" replica
space where all nodes are equal. This is not true for other
strategies, since due to rack and DC considerations, replicas are not
evenly distributed among nodes. To illustrate the problem, let us
consider the following scenario:
 cluster with nodes A through H. Nodes B and G are in DC1, rest of
the nodes in DC2.
 DC shard strategy, factor 3 (without loss of generality we can omit
rack considerations from this discussion).
In this situation ranges would be (node, primary, replicas):
A: HA, FG
B: AB, HA
C: BC, HA, AB
D: CD, BC
E: DE, CD
F: EF, DE
G: FG, AB, BC, CD, DE, EF
H: GH, EF, FG
Now in this situation most likely node G is by far the most loaded
one, so if a node bootstraps (either a new one, or a loadbalance
operation), it will take half of G's range. Problem is, it will take
half of G's _primary_ range, but most of G's load comes from
_replicas_. After this operation, the ring would be (X denotes the new
node):
A: HA, FG
B: AB, HA
C: BC, HA, AB
D: CD, BC
E: DE, CD
F: EF, DE
X: FX, AB, BC, CD, DE, EF
G: XG, FX
H: GH, EF, FX, XG
It is clear from this, that the situation has not really improved. The
only difference is that X is now the most loaded node and G probably
has a very light load. If another new node arrives, it will again go
in front of X, but the situation will remain largely the same.
In order to get rid of such replica sinks, nodes would need to
consider also replica "categories". When we're looking for replica
destinations, we essentially consider categories "in other DC", "in
other rack" and "anywhere". When node X boots in the ring above, it
should not just consider what is G's primary range, but what is G's
effective range (primary plus replicas). Amount of replicas is
determined largely by nodes that belong to the same replica category.
If X belongs in DC1 (same as B and G), best balance would be gained if
X booted in the middle of B and G, as that would divide replicas
evenly. This might not always be the best place, because individual
ranges might be very much different.
In order to fix this, the ideal solution would be to modify load
string so that, instead of total load, it reports both load from
primary range and load from replicas. This would allow bootstrapping
node to decide whether it should take half of replicas or half of the
primary range in order to get optimal result. However, there is no way
to get these two numbers, so we only have total load number. It would
not be perfect, but perhaps for now it would be best to only consider
nodes from the same category when making load balancing decisions.
That is, for rack unaware we consider all nodes as always, but for
other strategies we would determine the bootstrap token based on which
nodes are in the same category. Don't know if this would work, but
should be at least better than now.
Another related issue is: now that we have strategy per table, how
should we approach load balancing? Optimal decision for one strategy
might be bad for another strategy. If we have just one strategy in
use, that's clear, but for multiple strategies we'd need to determine
which one to favor.
Or am I thinking about this in a completely wrong way?
Jaakko
