helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Craig <mcr...@box.com>
Subject Re: Correct way to redistribute work from disconnected instances?
Date Thu, 20 Oct 2016 03:55:23 GMT
Thanks for the quick response Kishore. This issue is definitely tied to the
condition that partitions * replicas < NODE_COUNT.
If all running nodes have a "piece" of the resource, then they behave well
when the LEADER node goes away.

Is it possible to use Helix to manage a set of resources where that
condition is true? I.e. where the *total *number of partitions/replicas in
the cluster is greater than the node count, but each individual resource
has a small number of partitions/replicas.

(Calling rebalance on every liveInstance change does not seem like a good
solution, because you would have to iterate through all resources in the
cluster and rebalance each individually.)

On Wed, Oct 19, 2016 at 12:52 PM, kishore g <g.kishore@gmail.com> wrote:

> I think this might be a corner case when partitions * replicas <
> TOTAL_NUMBER_OF_NODES. Can you try with many partitions and replicas and
> check if the issue still exists.
> On Wed, Oct 19, 2016 at 11:53 AM, Michael Craig <mcraig@box.com> wrote:
>> I've noticed that partitions/replicas assigned to disconnected instances
>> are not automatically redistributed to live instances. What's the correct
>> way to do this?
>> For example, given this setup with Helix 0.6.5:
>> - 1 resource
>> - 2 replicas
>> - LeaderStandby state model
>> - FULL_AUTO rebalance mode
>> - 3 nodes (N1 is Leader, N2 is Standby, N3 is just sitting)
>> Then drop N1:
>> - N2 becomes LEADER
>> - Nothing happens to N3
>> Naively, I would have expected N3 to transition from Offline to Standby,
>> but that doesn't happen.
>> I can force redistribution from GenericHelixController#onLiveInstanceChange
>> by
>> - dropping non-live instances from the cluster
>> - calling rebalance
>> The instance dropping seems pretty unsafe! Is there a better way?

View raw message