cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Branimir Lambov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
Date Wed, 10 Dec 2014 16:29:13 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241332#comment-14241332
] 

Branimir Lambov commented on CASSANDRA-7032:
--------------------------------------------

Ignoring replication for the time being (more on that below), and looking at what's the best
thing we can do when we have an existing setup and we are trying to add a node, I came up
with the following approach.

We can only assign new vnodes, which means that we can only _take away_ load from other nodes,
never add to it. On the one hand this means that underutilized nodes are hopeless until the
cluster grows enough for their share to become normal. On the other it means that the best
thing to do (aiming for the smallest overutilization, i.e. max deviation from mean) is to
take the highest-load nodes and spread their load evenly between them and the new node.

Adding a new node gives us vnodes many (_vn_) new tokens to issue, i.e. we can decrease the
load in at most _vn_ other nodes. We can pick up the _vn_ highest-load ones, but some of them
may already have a lower load than the target spread; we thus select the largest _n <=
vn_ highest load nodes such that the spread load _t_, which is their combined load divided
by _n+1_, is lower than the load of each individual node. We can then choose how to assign
_vn_ tokens splitting some of the ranges in these _n_ nodes to reduce the load of each node
to _t_. This should also leave the new node with a load of _t_.

The attached code implements a simple version of this which improves overutilization very
quickly with every new node-- a typical simulation looks like:
{code}
Random generation of 1000 nodes with 256 tokens each
Size 1000   max 1.24 min 0.80   No replication
Adding 1 node(s) using NoReplicationTokenDistributor
Size 1001   max 1.11 min 0.80   No replication
Adding 9 node(s) using NoReplicationTokenDistributor
Size 1010   max 1.05 min 0.81   No replication
Adding 30 node(s) using NoReplicationTokenDistributor
Size 1040   max 1.02 min 0.83   No replication
Adding 210 node(s) using NoReplicationTokenDistributor
Size 1250   max 1.00 min 1.00   No replication
{code}
It also constructs clusters from empty pretty well.

However, when replication is present the load distribution of this allocation does not look
good (the added node tends to take much more than it should; one reason for this is that it
becomes a replica of the token ranges it splits), which is not unexpected. I am now trying
to see how exactly taking replication into account affects the reasoning above. We can still
only remove load, but the way splitting affects the loads is not that clear any more.

As far as I can see the following simplification of Cassandra's replication strategies should
suffice for handling the current and planned variations:
* we have units made up of a number of vnodes whose load we want to be able to balance (currently
unit==node, but in the future the unit could be smaller (a disk or core))
* units are bunched up in racks (if racks are not defined, a node is implicitly a rack for
its units)
* replicas of data must be placed on the closest higher vnodes that belong to different racks
* the replication strategy specifies the number of replicas and the set of units belonging
to each rack

Datacentres are irrelevant as replication is specified within each dc, i.e. we can isolate
the vnode allocation to the individual dc. If disk/core-level allocation is in place, the
node boundaries within a rack can be ignored as well. Is there anything I'm missing?


[~benedict]: I believe you prefer to split the disk/core workload inside the node by assigning
a token range (e.g. the vnodes that intersect with a range corresponding to _1/n_ of the token
ring are to be handled by that disk/core). I prefer to just choose _1/n_ of the vnodes, because
it lets me directly balance them-- do you have any objections to this?


> Improve vnode allocation
> ------------------------
>
>                 Key: CASSANDRA-7032
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Branimir Lambov
>              Labels: performance, vnodes
>             Fix For: 3.0
>
>         Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java
>
>
> It's been known for a little while that random vnode allocation causes hotspots of ownership.
It should be possible to improve dramatically on this with deterministic allocation. I have
quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will
repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also
ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The
allocation still permits slight discrepancies in ownership, but it is bound by the inverse
of the size of the cluster (as opposed to random allocation, which strangely gets worse as
the cluster size increases). I'm sure there is a decent dynamic programming solution to this
that would be even better.
> If on joining the ring a new node were to CAS a shared table where a canonical allocation
of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed
bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message