cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Branimir Lambov (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-7032) Improve vnode allocation
Date Fri, 06 Feb 2015 16:23:45 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14309383#comment-14309383
] 

Branimir Lambov edited comment on CASSANDRA-7032 at 2/6/15 4:23 PM:
--------------------------------------------------------------------

The answer relies on a couple of mappings:
- a Cassandra data centre is a separate ring to the algorithm. Data centres have their own
allocation of replicas, hence one can only balance load within them, not across multiple DCs.
The algorithm can simply ignore everything outside the data centre where the new node is spawned.
- in current Cassandra terms, a node is a node to the algorithm.
- in future Cassandra terms, a disk is a node to the algorithm*.
- a Cassandra rack is a rack to the algorithm if their number within the data centre is equal
to or higher than the replication factor, otherwise the Cassandra node is the rack of the
algorithm. Implicitly all disks in a machine belong to the same rack; the machine level can
be completely ignored if racks are defined.

The only fine point is starred. I know your preference is to split the token space in a single
range per disk. I see little benefit in this compared to simply assigning a subset of the
machine's vnodes to each disk. The latter is more flexible, perhaps even easier to achieve
when SSTables are per vnode. (Perhaps what's not so easy to see is that each vnode is always
responsible for replicating a contiguous range of tokens and it is easy to convert between
a vnode and the token range that it serves (see {{ReplicationStrategy.replicationStart}} in
the attached).) Slower disks take a small number of vnodes (4?), faster ones take more (8/16?).
Bigger machines have more disks hence more vnodes and automatically take a higher load.

The algorithm handles heterogeneous rack and node configuration without issues as long as
the number of distinct racks (in algorithm terms) is higher than the replication factor. 


was (Author: blambov):
The answer relies on a couple of mappings:
- a Cassandra data centre is a separate ring to the algorithm. Data centres have their own
allocation of replicas, hence one can only balance load within them, not across multiple DCs.
The algorithm can simply ignore everything outside the data centre where the new node is spawned.
- in current Cassandra terms, a node is a node to the algorithm.
- in future Cassandra terms, a disk is a node to the algorithm*.
- a Cassandra rack is a rack to the algorithm if there number within the data centre is equal
to or higher than the replication factor, otherwise the Cassandra node is the rack of the
algorithm. Implicitly all disks in a machine belong to the same rack; the machine level can
be completely ignored if racks are defined.

The only fine point is starred. I know your preference is to split the token space in a single
range per disk. I see little benefit in this compared to simply assigning a subset of the
machine's vnodes to each disk. The latter is more flexible, perhaps even easier to achieve
when SSTables are per vnode. (Perhaps what's not so easy to see is that each vnode is always
responsible for replicating a contiguous range of tokens and it is easy to convert between
a vnode and the token range that it serves (see {{ReplicationStrategy.replicationStart}} in
the attached).) Slower disks take a small number of vnodes (4?), faster ones take more (8/16?).
Bigger machines have more disks hence more vnodes and automatically take a higher load.

The algorithm handles heterogeneous rack and node configuration without issues as long as
the number of distinct racks (in algorithm terms) is higher than the replication factor. 

> Improve vnode allocation
> ------------------------
>
>                 Key: CASSANDRA-7032
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Branimir Lambov
>              Labels: performance, vnodes
>             Fix For: 3.0
>
>         Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java,
TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java
>
>
> It's been known for a little while that random vnode allocation causes hotspots of ownership.
It should be possible to improve dramatically on this with deterministic allocation. I have
quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will
repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also
ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The
allocation still permits slight discrepancies in ownership, but it is bound by the inverse
of the size of the cluster (as opposed to random allocation, which strangely gets worse as
the cluster size increases). I'm sure there is a decent dynamic programming solution to this
that would be even better.
> If on joining the ring a new node were to CAS a shared table where a canonical allocation
of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed
bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message