hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chaoyu Tang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-16487) Serious Zookeeper exception is logged when a race condition happens
Date Tue, 25 Apr 2017 16:12:04 GMT

    [ https://issues.apache.org/jira/browse/HIVE-16487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983128#comment-15983128
] 

Chaoyu Tang commented on HIVE-16487:
------------------------------------

[~pvary] I think the Exception e1 will be eaten, if it is not an instance of KeeperException,
after numRetriesForLock with this patch.

> Serious Zookeeper exception is logged when a race condition happens
> -------------------------------------------------------------------
>
>                 Key: HIVE-16487
>                 URL: https://issues.apache.org/jira/browse/HIVE-16487
>             Project: Hive
>          Issue Type: Bug
>          Components: Locking
>    Affects Versions: 3.0.0
>            Reporter: Peter Vary
>            Assignee: Peter Vary
>         Attachments: HIVE-16487.patch
>
>
> A customer started to see this in the logs, but happily everything was working as intended:
> {code}
> 2017-03-30 12:01:59,446 ERROR ZooKeeperHiveLockManager: [HiveServer2-Background-Pool:
Thread-620]: Serious Zookeeper exception: 
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hive_zookeeper_namespace/<TABLE_NAME>/LOCK-SHARED-
> {code}
> This was happening, because a race condition between the lock releasing, and lock acquiring.
The thread releasing the lock removes the parent ZK node just after the thread acquiring the
lock made sure, that the parent node exists.
> Since this can happen without any real problem, I plan to add NODEEXISTS, and NONODE
as a transient ZooKeeper exception, so the users are not confused.
> Also, the original author of ZooKeeperHiveLockManager maybe planned to handle different
ZooKeeperExceptions differently, and the code is hard to understand. See the {{continue}}
and the {{break}}. The {{break}} only breaks the switch, and not the loop which IMHO is not
intuitive:
> {code}
>     do {
>       try {
> [..]
>         ret = lockPrimitive(key, mode, keepAlive, parentCreated, 
>       } catch (Exception e1) {
>         if (e1 instanceof KeeperException) {
>           KeeperException e = (KeeperException) e1;
>           switch (e.code()) {
>           case CONNECTIONLOSS:
>           case OPERATIONTIMEOUT:
>             LOG.debug("Possibly transient ZooKeeper exception: ", e);
>             continue;
>           default:
>             LOG.error("Serious Zookeeper exception: ", e);
>             break;
>           }
>         }
> [..]
>       }
>     } while (tryNum < numRetriesForLock);
> {code}
> If we do not want to try again in case of a "Serious Zookeeper exception:", then we should
add a label to the do loop, and break it in the switch.
> If we do want to try regardless of the type of the ZK exception, then we should just
change the {{continue;}} to {{break;}} and move the lines part of the code which did not run
in case of {{continue}} to the {{default}} switch, so it is easier to understand the code.
> Any suggestions or ideas [~ctang.ma] or [~szehon]?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message