helix-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinayak Borkar (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HELIX-595) Possible deadlock in state transition sequence
Date Tue, 05 May 2015 01:14:06 GMT
Vinayak Borkar created HELIX-595:

             Summary: Possible deadlock in state transition sequence
                 Key: HELIX-595
                 URL: https://issues.apache.org/jira/browse/HELIX-595
             Project: Apache Helix
          Issue Type: Bug
            Reporter: Vinayak Borkar

In my setup I have a resource that has about 160 partitions. The resource uses the MasterSlave
state model. The partitions have been configured to have just 1 replica. For some partitions
(about 5), I am observing that there are two replicas, one in MASTER mode and one in SLAVE
mode. In addition, I am observing an imbalance with respect to the MASTER replica placement
on the machines I have.

In discussions with Kishore, the conclusion was that there is a deadlock occurring as Helix
makes state transition to rebalance the imbalance, and reaching a state where any further
transition would violate the constraints of the state model.

The MasterSlave state model allows at most one MASTER and at most R SLAVES (in my case R =

Say the current MASTER of a partition is on hostA, but Helix wants to move it to hostB. Helix
would run the following transitions:

hostA: t1(M -> S), t2(S -> O)
hostB: t3(O -> S), t4(S -> M)

If t1 and t2 happen before t3, then eventually, helix would achieve the correct placement
of the master on hostB. However, if t3 runs first, then hostB will have a SLAVE of the partition
while hostA still have MASTERship. Once this happens, every transition that needs to be performed
violates a state machine constraint. So we end up with a MASTER on hostA and a SLAVE on hostB
for this partition.

You can find the ZK logs corresponding to the MESSAGES for such a partition here: http://pastebin.com/zqqSk4MA

Please let me know what other details would be necessary to get to the bottom of this issue.

This message was sent by Atlassian JIRA

View raw message