[ https://issues.apache.org/jira/browse/CASSANDRA7032?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Branimir Lambov updated CASSANDRA7032:

Attachment: TestVNodeAllocation.java
A workinprogress algorithm for selecting vnodes in the replicated case is attached. The
main idea of the algorithm is to select token positions for each new vnode in such a way as
to get best improvement in replicated load variance (i.e. standard deviation) across nodes
and vnodes *1. More specifically, it prepares a selection of token positions to try (by picking
the middle positions between existing vnodes *2), evaluates the expected improvement in variance
for each selection and chooses the best *3, continuing until all the vnodes of the new node
have assigned tokens. To improve average performance, the expected improvement for all choices
is calculated once; for the second and later vnode we only recalculate it for the best candidate
until we find one that does not deteriorate to worse than the next option in the list *4.
Tested with simple factor3 replication, it maintains the following utilization ranges:
 1 vnode: 70%  135%
 4 vnodes: 80%  115%
 16 vnodes: 83  106%
 64 vnodes: 86  103%
 256 vnodes: 87  102%
Unlike random allocation, the overutilization does not grow with the number of nodes, and
a much smaller number of vnodes suffice (4 or 8 vnodes would probably be enough for most usecases).
The underutilization for this algorithm is affected less by the number of vnodes; this is
due to the effect of replication on newly added vnodes: they necessarily have to take the
share of one fewer vnode replica than the vnode they split (regardless of the algorithm we
use, if we add a new node to a large enough perfectly balanced cluster where all vnodes are
responsible for the same share of tokens, the new node will necessarily have at most 2/3 (for
RF=3) of the average load.). This could possibly be improved if we manage to keep enough individual
tokens with load closer to RF / (RF  1), which I've yet to try.
The algorithm is implemented in the {{ReplicationAwareTokenDistributor}} in the attached file.
Running the file simulates the effect of adding nodes using this algorithm on a randomlygenerated
cluster and prints out the minimum and maximum pernode and pertoken replicated load after
each step, as well as the standard deviation of the load. Sample results:
{code}
Random generation of 500 nodes with 8 tokens each
Size 500 node max 1.88 min 0.51 stddev 0.22193
Adding 1 node(s) using ReplicationAwareTokenDistributor
Size 501 node max 1.90 min 0.51 stddev 0.21922 Simple 3 replicas
Adding 4 node(s) using ReplicationAwareTokenDistributor
Size 505 node max 1.63 min 0.51 stddev 0.20580 token max 3.72 min 0.01 stddev 0.58768
Simple 3 replicas
Adding 15 node(s) using ReplicationAwareTokenDistributor
Size 520 node max 1.51 min 0.53 stddev 0.17369 token max 3.83 min 0.01 stddev 0.54526
Simple 3 replicas
Adding 105 node(s) using ReplicationAwareTokenDistributor
Size 625 node max 1.15 min 0.63 stddev 0.08069 token max 2.73 min 0.00 stddev 0.40190
Simple 3 replicas
Adding 375 node(s) using ReplicationAwareTokenDistributor
Size 1000 node max 1.08 min 0.84 stddev 0.03041 token max 1.99 min 0.00 stddev 0.22341
Simple 3 replicas
Losing 1 nodes
Size 999 node max 1.09 min 0.84 stddev 0.03081 token max 1.98 min 0.00 stddev 0.22429
Simple 3 replicas
Adding 1 node(s) using ReplicationAwareTokenDistributor
Size 1000 node max 1.08 min 0.84 stddev 0.03019 token max 1.99 min 0.00 stddev 0.22335
Simple 3 replicas
Losing 5 nodes
Size 995 node max 1.17 min 0.83 stddev 0.03380 token max 2.01 min 0.00 stddev 0.22565
Simple 3 replicas
Adding 5 node(s) using ReplicationAwareTokenDistributor
Size 1000 node max 1.08 min 0.84 stddev 0.03000 token max 1.99 min 0.00 stddev 0.22181
Simple 3 replicas
Losing 20 nodes
Size 980 node max 1.19 min 0.88 stddev 0.04362 token max 2.44 min 0.00 stddev 0.23370
Simple 3 replicas
Adding 20 node(s) using ReplicationAwareTokenDistributor
Size 1000 node max 1.08 min 0.89 stddev 0.02962 token max 1.99 min 0.00 stddev 0.21681
Simple 3 replicas
Losing 125 nodes
Size 875 node max 1.31 min 0.79 stddev 0.08499 token max 2.81 min 0.00 stddev 0.28763
Simple 3 replicas
Adding 125 node(s) using ReplicationAwareTokenDistributor
Size 1000 node max 1.08 min 0.90 stddev 0.02805 token max 1.85 min 0.00 stddev 0.19258
Simple 3 replicas
{code}
This is far from finished as it is much slower than I'd like it to be.
Notes / other things I've tried:
 *1 Only controlling individual vnode load: Because of the replication effect mentioned
above, the ratio between largest and smallest node has to necessarily be at best 3:2 (for
RF=3). If we don't control overall node size, about 30% over/underutilization is the best
we can achieve regardless of vnode count. Moreover, when this is applied to a randomlygenerated
cluster, the first few added nodes tend to significantly increase the node overutilization
in the cluster.
 *1 Only controlling the load of the whole node: This gets better results, but it can enter
pathological states where a large token is never split because that would create an even bigger
new node. The latter happens too often for this option to be useable.
 *2 Picking a choice of e.g. 4 tokens splitting each existing vnode in 1/5 increments does
not appear to give any noticeable improvement over just picking the middle for RF>1.
 *3 The evaluation of the expected improvement needs to include a component for the load
on the new node as well. To be able to assign vnodes independently (otherwise the complexity
of the procedure would go out of control) we use the following heuristic: assume the new node
starts at an optimal size, optimally split between the vnodes, and gradually replace this
ideal view with the actual assigned tokens as they come. An additional weight of filled_vnodes/vnode_count
is applied to the new node to allow for greater flexibility in choosing the first tokens,
which helps avoid the pathological cases.
 *4 This heuristic does not necessarily yield the optimal choice, because assigning a token
changes the expected improvement for the load of the new node for the other candidates and
there is a chance that we will pick a candidate before reaching a better option that has improved
a lot. In practice the heuristic appears to give results that are very close to picking the
actual best candidate.
> Improve vnode allocation
> 
>
> Key: CASSANDRA7032
> URL: https://issues.apache.org/jira/browse/CASSANDRA7032
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Benedict
> Assignee: Branimir Lambov
> Labels: performance, vnodes
> Fix For: 3.0
>
> Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java,
TestVNodeAllocation.java
>
>
> It's been known for a little while that random vnode allocation causes hotspots of ownership.
It should be possible to improve dramatically on this with deterministic allocation. I have
quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will
repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also
ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The
allocation still permits slight discrepancies in ownership, but it is bound by the inverse
of the size of the cluster (as opposed to random allocation, which strangely gets worse as
the cluster size increases). I'm sure there is a decent dynamic programming solution to this
that would be even better.
> If on joining the ring a new node were to CAS a shared table where a canonical allocation
of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed
bounds on the ownership distribution in a cluster. This will also help for CASSANDRA6696.

This message was sent by Atlassian JIRA
(v6.3.4#6332)
