curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephen Ingram (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CURATOR-205) Repeated InterruptedExceptions during mutex acquire leads to LeaderSelector deadlock
Date Wed, 08 Apr 2015 18:35:12 GMT

     [ https://issues.apache.org/jira/browse/CURATOR-205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stephen Ingram updated CURATOR-205:
-----------------------------------
    Description: 
When an InterruptedException is thrown during the internalLockLoop that is called during mutex.acquire,
internalLockLoop will set a flag "doDelete" which signals during a finally clause to delete
the lock path that we are trying to create.

However, in the pathInForeground function of DeleteBuilderImpl, a _second_ InterruptedException
may occur before zookeeper can delete the specified path.  The RetryLoop machinery contained
in the function will only retry if it is a Retryable Exception, an equivalence class which
does not include InterruptedExceptions.  

The second InterruptedException exception then causes an exit of the pathInForeground function
without deleting the path, leading to a deadlock where no one can acquire the mutex.

In my test, I am certain that both of these InterruptedExceptions are due to repeated fluctuation
in the ConnectionStateManager's connection state.  When the state ceases to fluctuate, no
leader can be selected due to the persistence of the node we failed to delete.

I was able to address this bug with a solution similar to CURATOR-45:  if the pathInForeground
function is interrupted with an InterruptedException, I schedule a BackgroundCallback to attempt
pathInForeground again.  This task is able to delete the path when the connection is stable
and the mutex is acquired by the new leader.

I have a repro and a fix if needed.

  was:
When an InterruptedException is thrown during the internalLockLoop that is called during mutex.acquire,
internalLockLoop will set a flag "doDelete" which signals during a finally clause to delete
the lock path that we are trying to create.

However, in the pathInForeground function of DeleteBuilderImpl, a _second_ InterruptedException
may occur before zookeeper can delete the specified path.  The RetryLoop machinery contained
in the function will only retry if it is a Retryable Exception, an equivalence class which
does not include InterruptedExceptions.  

The second InterruptedException exception then causes an exit of the pathInForeground function
without deleting the path, leading to a deadlock where no one can acquire the mutex.

In my test, I am certain that both of these InterruptedExceptions are due to repeated fluctuation
in the ConnectionStateManager's connection state.  When the state ceases to fluctuate, no
leader can be selected due to the persistence of the node we failed to delete.

I was able to address this bug with a solution similar to CURATOR-45:  if the pathInForeground
function is interrupted with an InterruptedException, I schedule a BackgroundCallback to attempt
pathInForeground again.  This task is able to delete the path when the connection is stable
and the mutex is acquired by the new leader.


> Repeated InterruptedExceptions during mutex acquire leads to LeaderSelector deadlock
> ------------------------------------------------------------------------------------
>
>                 Key: CURATOR-205
>                 URL: https://issues.apache.org/jira/browse/CURATOR-205
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 2.7.2
>            Reporter: Stephen Ingram
>
> When an InterruptedException is thrown during the internalLockLoop that is called during
mutex.acquire, internalLockLoop will set a flag "doDelete" which signals during a finally
clause to delete the lock path that we are trying to create.
> However, in the pathInForeground function of DeleteBuilderImpl, a _second_ InterruptedException
may occur before zookeeper can delete the specified path.  The RetryLoop machinery contained
in the function will only retry if it is a Retryable Exception, an equivalence class which
does not include InterruptedExceptions.  
> The second InterruptedException exception then causes an exit of the pathInForeground
function without deleting the path, leading to a deadlock where no one can acquire the mutex.
> In my test, I am certain that both of these InterruptedExceptions are due to repeated
fluctuation in the ConnectionStateManager's connection state.  When the state ceases to fluctuate,
no leader can be selected due to the persistence of the node we failed to delete.
> I was able to address this bug with a solution similar to CURATOR-45:  if the pathInForeground
function is interrupted with an InterruptedException, I schedule a BackgroundCallback to attempt
pathInForeground again.  This task is able to delete the path when the connection is stable
and the mutex is acquired by the new leader.
> I have a repro and a fix if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message