helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Craig <mcr...@box.com>
Subject Re: Correct way to redistribute work from disconnected instances?
Date Thu, 20 Oct 2016 06:52:55 GMT
Here is some repro code for "drop a node, resource is not redistributed"
case I described:

Can we answer these 2 questions? That would help clarify things:

   - Should you have to `rebalance` a resource when adding a new node to
   the cluster?
   - If no, this is an easy bug to reproduce. The example code
      calls rebalance after adding a node, and it breaks if you
comment out that
      - If yes, what is the correct way to manage many resources on a
      cluster? Iterate through all resources and rebalance them for every new
   - Should you have to `rebalance` when a node is dropped?
      - If no, there is a bug. See the repro code posted above.
      - If yes, we are in the same rebalance-every-resource situation as

My use case is to manage a set of ad-hoc tasks across a cluster of
machines. Each task would be a separate resource with a unique name, with 1
partition and 1 replica. Each resource would reside on exactly 1 node, and
there is no limit on the number of resources per node.

On Wed, Oct 19, 2016 at 9:23 PM, Lei Xia <xiaxlei@gmail.com> wrote:

> Hi, Michael
>   Could you be more specific on the issue you see? Specifically:
>   1) For 1 resource and 2 replicas, you mean the resource has only 1
> partition, with replica number equals to 2, right?
> your idealState, right?
>   3) by dropping N1, you mean disconnect N1 from helix/zookeeper, so N1 is
> not in liveInstances, right?
>   If your answers to all of above questions are yes, then there may be
> some bug here.  If possible, please paste your idealstate, and your test
> code (if there is any) here, I will try to reproduce and debug it.  Thanks
> Lei
> On Wed, Oct 19, 2016 at 9:02 PM, kishore g <g.kishore@gmail.com> wrote:
>> Can you describe your scenario in detail and the expected behavior?. I
>> agree calling rebalance on every live instance change is ugly and
>> definitely not as per the design. It was an oversight (we focussed a lot of
>> large number of partitions and failed to handle this simple case).
>> Please file and jira and we will work on that. Lei, do you think the
>> recent bug we fixed with AutoRebalancer will handle this case?
>> thanks,
>> Kishore G
>> On Wed, Oct 19, 2016 at 8:55 PM, Michael Craig <mcraig@box.com> wrote:
>>> Thanks for the quick response Kishore. This issue is definitely tied to
>>> the condition that partitions * replicas < NODE_COUNT.
>>> If all running nodes have a "piece" of the resource, then they behave
>>> well when the LEADER node goes away.
>>> Is it possible to use Helix to manage a set of resources where that
>>> condition is true? I.e. where the *total *number of partitions/replicas
>>> in the cluster is greater than the node count, but each individual resource
>>> has a small number of partitions/replicas.
>>> (Calling rebalance on every liveInstance change does not seem like a
>>> good solution, because you would have to iterate through all resources in
>>> the cluster and rebalance each individually.)
>>> On Wed, Oct 19, 2016 at 12:52 PM, kishore g <g.kishore@gmail.com> wrote:
>>>> I think this might be a corner case when partitions * replicas <
>>>> TOTAL_NUMBER_OF_NODES. Can you try with many partitions and replicas and
>>>> check if the issue still exists.
>>>> On Wed, Oct 19, 2016 at 11:53 AM, Michael Craig <mcraig@box.com> wrote:
>>>>> I've noticed that partitions/replicas assigned to disconnected
>>>>> instances are not automatically redistributed to live instances. What's
>>>>> correct way to do this?
>>>>> For example, given this setup with Helix 0.6.5:
>>>>> - 1 resource
>>>>> - 2 replicas
>>>>> - LeaderStandby state model
>>>>> - FULL_AUTO rebalance mode
>>>>> - 3 nodes (N1 is Leader, N2 is Standby, N3 is just sitting)
>>>>> Then drop N1:
>>>>> - N2 becomes LEADER
>>>>> - Nothing happens to N3
>>>>> Naively, I would have expected N3 to transition from Offline to
>>>>> Standby, but that doesn't happen.
>>>>> I can force redistribution from GenericHelixController#onLiveInstanceChange
>>>>> by
>>>>> - dropping non-live instances from the cluster
>>>>> - calling rebalance
>>>>> The instance dropping seems pretty unsafe! Is there a better way?
> --
> Lei Xia

View raw message