cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
Date Thu, 02 Apr 2015 16:29:54 GMT


Benedict commented on CASSANDRA-7032:

I've pushed a version of this [here|]
which has a lot of comments and some small refactors performed, that I needed to perform in
order to understand the code.

On the whole it seems like a good approach, and carefully optimised.

I think some further changes are necessary for improved understanding, as the functions are
currently very dense, with a focus on performance even in areas that won't make a tremendous
difference. Algorithmically it's good, but we should IMO preferably separate the temporary
state maintenance and cleanup from the main goal steps, and avoid inflating every nested property
access into its own variable, as it pollutes the method body and makes it hard to track which
actions are important. Inside {{evaluateImprovement}} the ownership adjustment calculations
should all preferably be split out into separate methods, and if possible the {{unitsChain}}
there and {{prevPopulate}} chain in {{populateTokenInfo}} should use a class like {{ReplicationVisitor}}
that I introduced to extract some of the dense state maintenance and make the logic more declarative.

I'm still a little confused by the {{expandable}} property, both because I don't understand
its name, but also because its definition doesn't seem to match up to how it's used (it's
quite likely I'm misunderstanding though, so if you could consider my code comment, and explain
- preferably with clearer comments around its stated aim)

I think there is a potential small modification to be made as well, which may or may not be
helpful. It bothers me that we can in theory create very suboptimal chains for establishing
the owners of a token, by having a bunching of groups within the same replication range. Since
these should _generally_ lead to worse distribution outcomes this algorithm should generally
fight against them, but it is possible to find problematic local minima. It would be nice
to pre-exclude any candidates that would lead to an increased clumping of groups. Perhaps
even to exclude any tokens that would shadow the ownership range of a partner (same group),
if we have more than RF groups, and otherwise permitting a maximum traversal of one partner
to define the replication range.

> Improve vnode allocation
> ------------------------
>                 Key: CASSANDRA-7032
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Branimir Lambov
>              Labels: performance, vnodes
>             Fix For: 3.0
>         Attachments:,,,,,
> It's been known for a little while that random vnode allocation causes hotspots of ownership.
It should be possible to improve dramatically on this with deterministic allocation. I have
quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will
repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also
ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The
allocation still permits slight discrepancies in ownership, but it is bound by the inverse
of the size of the cluster (as opposed to random allocation, which strangely gets worse as
the cluster size increases). I'm sure there is a decent dynamic programming solution to this
that would be even better.
> If on joining the ring a new node were to CAS a shared table where a canonical allocation
of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed
bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696.

This message was sent by Atlassian JIRA

View raw message