curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CURATOR-358) Receiving KeeperException with NoNode when LeaderLatch#getLeader()
Date Sun, 20 Nov 2016 23:43:58 GMT

    [ https://issues.apache.org/jira/browse/CURATOR-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15682042#comment-15682042
] 

ASF GitHub Bot commented on CURATOR-358:
----------------------------------------

GitHub user cammckenzie opened a pull request:

    https://github.com/apache/curator/pull/173

    CURATOR-358 - Fixed race condition with getLeader()

    -If leadership changes between the getParticipantNodes() call and the getLeader() internal
call the NoNodeException is now handled and the next child in the list is evaluated.
    
    Another option would be to just return the default empty Participant object and not iterate
over the whole list of participants.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/curator CURATOR-358

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/curator/pull/173.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #173
    
----
commit 3478aca7ed6852484b5574a6082f4bb75c04a1e0
Author: Cam McKenzie <cammckenzie@apache.org>
Date:   2016-11-20T23:38:15Z

    CURATOR-358 - Fixed race condition with getLeader()
    -If leadership changes between the getParticipantNodes() call and the getLeader() internal
call the NoNodeException is now handled and the next child in the list is evaluated.

----


> Receiving KeeperException with NoNode when LeaderLatch#getLeader()
> ------------------------------------------------------------------
>
>                 Key: CURATOR-358
>                 URL: https://issues.apache.org/jira/browse/CURATOR-358
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 2.10.0
>            Reporter: Satish Duggana
>            Priority: Critical
>
> org.apache.curator.framework.recipes.leader.LeaderLatch#getLeader() throws KeeperException
with Code#NONODE intermittently as mentioned in the stack trace below. It may be possible
 participant's ephemeral ZK node is removed because its connection/session is closed. 
> You can see the below code at https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L451
> public Participant getLeader() throws Exception
> {
>     Collection<String> participantNodes = LockInternals.getParticipantNodes(client,
latchPath, LOCK_NAME, sorter);
>     return LeaderSelector.getLeader(client, participantNodes);
> }
> I guess it hits a race condition where a participant node is retrieved but when it invokes
LeaderSelector#getLeader() it would have been removed because of session timeout and it throws
KeeperException with NoNode code. It does not retry as the RetryLoop retries only for connection/session
timeouts. But in this case, NoNode should have been retried. I could not find any APIs on
CuratorClient to configure the kind of KeeperException codes to be retried. It may be good
to have a way to take what kind of errors should be retried in org.apache.curator.framework.CuratorFrameworkFactory.Builder
APIs. 
> Intermittent Exception found with the stack trace:
> 2016-11-15 06:09:33.954 o.a.s.d.nimbus [ERROR] Error when processing event
> org.apache.storm.shade.org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode
= NoNode for /storm/leader-lock/_c_97c09eed-5bba-4ac8-a05f-abdc4e8e95cf-latch-0000000002
>      at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>      at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>      at org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
>      at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:304)
>      at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:293)
>      at org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108)
>      at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:290)
>      at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:281)
>      at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:42)
>      at org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderSelector.participantForPath(LeaderSelector.java:375)
>      at org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:346)
>      at org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderLatch.getLeader(LeaderLatch.java:454)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message