helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lei Xia <xiax...@gmail.com>
Subject Re: Correct way to redistribute work from disconnected instances?
Date Thu, 20 Oct 2016 04:23:40 GMT
Hi, Michael

  Could you be more specific on the issue you see? Specifically:
  1) For 1 resource and 2 replicas, you mean the resource has only 1
partition, with replica number equals to 2, right?
your idealState, right?
  3) by dropping N1, you mean disconnect N1 from helix/zookeeper, so N1 is
not in liveInstances, right?

  If your answers to all of above questions are yes, then there may be some
bug here.  If possible, please paste your idealstate, and your test code
(if there is any) here, I will try to reproduce and debug it.  Thanks


On Wed, Oct 19, 2016 at 9:02 PM, kishore g <g.kishore@gmail.com> wrote:

> Can you describe your scenario in detail and the expected behavior?. I
> agree calling rebalance on every live instance change is ugly and
> definitely not as per the design. It was an oversight (we focussed a lot of
> large number of partitions and failed to handle this simple case).
> Please file and jira and we will work on that. Lei, do you think the
> recent bug we fixed with AutoRebalancer will handle this case?
> thanks,
> Kishore G
> On Wed, Oct 19, 2016 at 8:55 PM, Michael Craig <mcraig@box.com> wrote:
>> Thanks for the quick response Kishore. This issue is definitely tied to
>> the condition that partitions * replicas < NODE_COUNT.
>> If all running nodes have a "piece" of the resource, then they behave
>> well when the LEADER node goes away.
>> Is it possible to use Helix to manage a set of resources where that
>> condition is true? I.e. where the *total *number of partitions/replicas
>> in the cluster is greater than the node count, but each individual resource
>> has a small number of partitions/replicas.
>> (Calling rebalance on every liveInstance change does not seem like a good
>> solution, because you would have to iterate through all resources in the
>> cluster and rebalance each individually.)
>> On Wed, Oct 19, 2016 at 12:52 PM, kishore g <g.kishore@gmail.com> wrote:
>>> I think this might be a corner case when partitions * replicas <
>>> TOTAL_NUMBER_OF_NODES. Can you try with many partitions and replicas and
>>> check if the issue still exists.
>>> On Wed, Oct 19, 2016 at 11:53 AM, Michael Craig <mcraig@box.com> wrote:
>>>> I've noticed that partitions/replicas assigned to disconnected
>>>> instances are not automatically redistributed to live instances. What's the
>>>> correct way to do this?
>>>> For example, given this setup with Helix 0.6.5:
>>>> - 1 resource
>>>> - 2 replicas
>>>> - LeaderStandby state model
>>>> - FULL_AUTO rebalance mode
>>>> - 3 nodes (N1 is Leader, N2 is Standby, N3 is just sitting)
>>>> Then drop N1:
>>>> - N2 becomes LEADER
>>>> - Nothing happens to N3
>>>> Naively, I would have expected N3 to transition from Offline to
>>>> Standby, but that doesn't happen.
>>>> I can force redistribution from GenericHelixController#onLiveInstanceChange
>>>> by
>>>> - dropping non-live instances from the cluster
>>>> - calling rebalance
>>>> The instance dropping seems pretty unsafe! Is there a better way?

Lei Xia

View raw message