helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kishore g <g.kish...@gmail.com>
Subject Re: Correct way to redistribute work from disconnected instances?
Date Thu, 20 Oct 2016 04:02:55 GMT
Can you describe your scenario in detail and the expected behavior?. I
agree calling rebalance on every live instance change is ugly and
definitely not as per the design. It was an oversight (we focussed a lot of
large number of partitions and failed to handle this simple case).

Please file and jira and we will work on that. Lei, do you think the recent
bug we fixed with AutoRebalancer will handle this case?

Kishore G

On Wed, Oct 19, 2016 at 8:55 PM, Michael Craig <mcraig@box.com> wrote:

> Thanks for the quick response Kishore. This issue is definitely tied to
> the condition that partitions * replicas < NODE_COUNT.
> If all running nodes have a "piece" of the resource, then they behave well
> when the LEADER node goes away.
> Is it possible to use Helix to manage a set of resources where that
> condition is true? I.e. where the *total *number of partitions/replicas
> in the cluster is greater than the node count, but each individual resource
> has a small number of partitions/replicas.
> (Calling rebalance on every liveInstance change does not seem like a good
> solution, because you would have to iterate through all resources in the
> cluster and rebalance each individually.)
> On Wed, Oct 19, 2016 at 12:52 PM, kishore g <g.kishore@gmail.com> wrote:
>> I think this might be a corner case when partitions * replicas <
>> TOTAL_NUMBER_OF_NODES. Can you try with many partitions and replicas and
>> check if the issue still exists.
>> On Wed, Oct 19, 2016 at 11:53 AM, Michael Craig <mcraig@box.com> wrote:
>>> I've noticed that partitions/replicas assigned to disconnected instances
>>> are not automatically redistributed to live instances. What's the correct
>>> way to do this?
>>> For example, given this setup with Helix 0.6.5:
>>> - 1 resource
>>> - 2 replicas
>>> - LeaderStandby state model
>>> - FULL_AUTO rebalance mode
>>> - 3 nodes (N1 is Leader, N2 is Standby, N3 is just sitting)
>>> Then drop N1:
>>> - N2 becomes LEADER
>>> - Nothing happens to N3
>>> Naively, I would have expected N3 to transition from Offline to Standby,
>>> but that doesn't happen.
>>> I can force redistribution from GenericHelixController#onLiveInstanceChange
>>> by
>>> - dropping non-live instances from the cluster
>>> - calling rebalance
>>> The instance dropping seems pretty unsafe! Is there a better way?

View raw message