geode-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Reich <>
Subject [PROPOSAL]: concurrent bucket moves during rebalance
Date Thu, 08 Mar 2018 19:25:50 GMT

The time required to undertake a rebalance of a geode cluster has often
been an area for improvement noted by users. Currently, buckets are moved
one at a time and we propose that creating a system that moved buckets in
parallel could greatly improve performance for this feature.

Previously, parallelization was implemented for adding redundant copies of
buckets to restore redundancy. However, moving buckets is a more
complicated matter and requires a different approach than restoration of
redundancy. The reason for this is that members could be potentially both
be gaining buckets and giving away buckets at the same time. While giving
away a bucket, that member still has all of the data for the bucket, until
the receiving member has fully received the bucket and it can safely be
removed from the original owner. This means that unless the member has the
memory overhead to store all of the buckets it will receive and all the
buckets it started with, there is potential that parallel moving of buckets
could cause the member to run out of memory.

For this reason, we propose a system that does (potentially) several rounds
of concurrent bucket moves:
1) A set of moves is calculated to improve balance that meet a requirement
that no member both receives and gives away a bucket (no member will have
memory overhead of an existing bucket it is ultimately removing and a new
2) Conduct all calculated bucket moves in parallel. Parameters to throttle
this process (to prevent taking too many cluster resources, impacting
performance) should be added, such as only allowing each member to either
receive or send a maximum number of buckets concurrently.
3) If cluster is not yet balanced, perform additional iterations of
calculating and conducting bucket moves, until balance is achieved or a
possible maximum iterations is reached.
Note: in both the existing and proposed system, regions are rebalanced one
at a time.

Please let us know if you have feedback on this approach or additional
ideas that should be considered.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message