helix-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhen Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HELIX-541) Possible "livelock" in Helix controller
Date Mon, 24 Nov 2014 23:01:13 GMT

    [ https://issues.apache.org/jira/browse/HELIX-541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223719#comment-14223719

Zhen Zhang commented on HELIX-541:

Here is the problem.
- At time t1: Node_0 is selected LEADER for partition_0, and we have a pending message, say
STANDBY->LEADER for (partition_0, Node_0).

- At time t2 (>t1): Node_0 finishes STANDBY->LEADER transition but hasn't deleted the
message. Meantime, controller selects another LEADER for partition_0, say Node_1. Since Node_0's
current state is LEADER and we are only considering toState in pending message, controller
thinks it's OK to send OFFLINE->STANDBY t0 Node_1.

- At time t3 (>t2): Node_0 in LEADER and Node_1 in STANDBY, controller now enters into
the livelock where neither can it send STANDBY->LEADERl to Node_1, nor it can send LEADER->STANDBY
to Node_0.

The fix it to consider both fromState and toState in pending message, so controller will not
send OFFLINE->STANDBY to Node_1 if Node_0 hasn't remove STANDBY->LEADER message. For
the next Helix pipeline, since we have transition priority that LEADER->STANDBY > OFFLINE->STANDBY,
controller will bring Node_0 to STANDBY then to OFFLINE first, avoiding the livelock.

> Possible "livelock" in Helix controller
> ---------------------------------------
>                 Key: HELIX-541
>                 URL: https://issues.apache.org/jira/browse/HELIX-541
>             Project: Apache Helix
>          Issue Type: Bug
>            Reporter: Zhen Zhang
> We discover a "livelock" bug in Helix controller. This leads to HELIX-540. Assuming we
have 3 partitions and 2 nodes, using LeaderStandby state model, FULL_AUTO mode, and replica
is 2. When reaching stable mapping, we might have the following:
> {noformat}
> partition_0:
>   node_0: LEADER
>   node_1: STANDBY
> partition_1:
>   node_1: LEADER
>   node_0: STANDBY
> partition_2:
>   node_0: LEADER
>   node_1: STANDBY
> {noformat}
> Later we add a new node (node_2) to the cluster and rebalancer decides that node_2 should
become LEADER for partition_2. So controller first sends OFFLINE->STANDBY transition to
node_2, and the mapping becomes:
> {noformat}
> partition_0:
>   node_0: LEADER
>   node_1: STANDBY
> partition_1:
>   node_1: LEADER
>   node_0: STANDBY
> partition_2:
>   node_0: LEADER
>   node_1: STANDBY
>   node_2: STANDBY
> {noformat}
> Note that given LEADER state count is 1 and STANDBY state count is R, where R=2, it is
implying the following state constraints:
> {noformat}
> LEADER: upper_bound=1
> STANDBY: upper_bound=2
> {noformat}
> Helix controller now enters the "livelock": it can't send STANDBY->LEADER to node_2,
since this will violate LEADER upper bound; it can't send LEADER->STANDBY to node_0 either,
since will violate STANDBY upper bound.
> We can solve the problem in several ways:
> 1) State count definition is ambiguous. For some state, like LEADER, when we say state_count=1,
that means we can't violate this constraint at any time. However, for some other state, like
STANDBY, when we say state_count=R, that means in stable mapping, there should be R-1 STANDBY
replicas, but we don't care the count in any transient state. In this case, we can set STANDBY
upper_bound to be larger than 2. Note that this doesn't solve the problem in general. We may
have some state model that has a restrict requirement on STANDBY state.
> 2) Using state transition priority. If we define LEADER->STANDBY transition should
have a higher priority than OFFLINE->STANDBY transition, it will solve the "livelock".
But this doesn't solve the problem in general either, because when the state model gets complicated,
it's hard to define and prove proper transition priorities that avoid "livelock" in any situation.
> 3) The root cause of the problem is that Helix controller uses a greedy algorithm that
only looks one step ahead. In the example, if Helix controller can look two steps further,
then it will find out that sending OFFLINE->STANDBY transition to node_2 will lead to a
dead end, therefore it should choose to send LEADER->STANDBY to node_0 instead. In general
we might need to do a DFS/BFS and it's hard if state model is complicated and system is large.
> In practice, most systems use simple state models with less than 5 states and have strict
state constraint on a single state (e.g MASTER, LEADER) only.  We can avoid "livelock" by
carefully choosing state constraints.

This message was sent by Atlassian JIRA

View raw message