From issues-return-153215-archive-asf-public=cust-asf.ponee.io@hive.apache.org Tue Mar 19 02:10:04 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id E8DAC180651 for ; Tue, 19 Mar 2019 03:10:03 +0100 (CET) Received: (qmail 25429 invoked by uid 500); 19 Mar 2019 02:10:03 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 25405 invoked by uid 99); 19 Mar 2019 02:10:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Mar 2019 02:10:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 96D5FC183D for ; Tue, 19 Mar 2019 02:10:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -110.301 X-Spam-Level: X-Spam-Status: No, score=-110.301 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id hP5JqtjE84K2 for ; Tue, 19 Mar 2019 02:10:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id B637E5F57A for ; Tue, 19 Mar 2019 02:10:00 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 4F6B8E00A7 for ; Tue, 19 Mar 2019 02:10:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 12F6424597 for ; Tue, 19 Mar 2019 02:10:00 +0000 (UTC) Date: Tue, 19 Mar 2019 02:10:00 +0000 (UTC) From: "David Mollitor (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated HIVE-21469: ---------------------------------- Description: A lot of sins in this class to resolve: {code:java} @Override public void setContext(HiveLockManagerCtx ctx) throws LockException { try { curatorFramework = CuratorFrameworkSingleton.getInstance(conf); parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE); try{ curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" + parent, new byte[0]); } catch (Exception e) { // ignore if the parent already exists if (!(e instanceof KeeperException) || ((KeeperException)e).code() != KeeperException.Code.NODEEXISTS) { LOG.warn("Unexpected ZK exception when creating parent node /" + parent, e); } } {code} Every time a new session is created and this {{setContext}} method is called, it attempts to create the root node. I have seen that, even though the root node exists, an create node action is written to the ZK logs. Check first if the node exists before trying to create it. {code:java} try { curatorFramework.delete().forPath(zLock.getPath()); } catch (InterruptedException ie) { curatorFramework.delete().forPath(zLock.getPath()); } {code} There has historically been a quite a few bugs regarding leaked locks. The Driver will signal the session {{Thread}} by performing an interrupt. That interrupt can happen any time and it can kill a create/delete action within the ZK framework. We can see one example of workaround for this. If the ZK action is interrupted, simply do it again. Well, what if it's interrupted yet again? The lock will be leaked. Also, when the {{InterruptedException}} is caught in the try block, the thread's interrupted flag is cleared. The flag is not reset in this code and therefore we lose the fact that this thread has been interrupted. This flag should be preserved so that other code paths will know that it's time to exit. {code:java} if (tryNum > 1) { Thread.sleep(sleepTime); } unlockPrimitive(hiveLock, parent, curatorFramework); break; } catch (Exception e) { if (tryNum >= numRetriesForUnLock) { String name = ((ZooKeeperHiveLock)hiveLock).getPath(); throw new LockException("Node " + name + " can not be deleted after " + numRetriesForUnLock + " attempts.", e); } } {code} ... related... the sleep here may be interrupted, but we still need to delete the lock (again, for fear of leaking it). This sleep should be uninterruptible. If we need to get the lock deleted, and there's a problem, interrupting the sleep will cause the code to eventually exit and locks will be leaked. It also requires a bunch more TLC. was: A lot of sins in this class to resolve: {code:java} @Override public void setContext(HiveLockManagerCtx ctx) throws LockException { try { curatorFramework = CuratorFrameworkSingleton.getInstance(conf); parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE); try{ curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" + parent, new byte[0]); } catch (Exception e) { // ignore if the parent already exists if (!(e instanceof KeeperException) || ((KeeperException)e).code() != KeeperException.Code.NODEEXISTS) { LOG.warn("Unexpected ZK exception when creating parent node /" + parent, e); } } {code} Every time a new session is created and this {{setContext}} method is called, it attempts to create the root node. I have seen that, even though the root node exists, an create node action is written to the ZK logs. Check first if the node exists before trying to create it. {code:java} try { curatorFramework.delete().forPath(zLock.getPath()); } catch (InterruptedException ie) { curatorFramework.delete().forPath(zLock.getPath()); } {code} There has historically been a quite a few bugs regarding leaked locks. The Driver will signal the session {{Thread}} by performing an interrupt. That interrupt can happen any time and it can kill a create/delete action within the ZK framework. We can see one example of workaround for this. If the ZK action is interrupted, simply do it again. Well, what if it's interrupted yet again? The lock will be leaked anyway. Also, when the {{InterruptedException}} is caught in the try block, the thread's interrupted flag is cleared. The flag is not reset in this code and therefore we lose the fact that this thread has been interrupted. {code:java} if (tryNum > 1) { Thread.sleep(sleepTime); } unlockPrimitive(hiveLock, parent, curatorFramework); break; } catch (Exception e) { if (tryNum >= numRetriesForUnLock) { String name = ((ZooKeeperHiveLock)hiveLock).getPath(); throw new LockException("Node " + name + " can not be deleted after " + numRetriesForUnLock + " attempts.", e); } } {code} ... related... the sleep here may be interrupted, but we still need to delete the lock (again, for fear of leaking it). This sleep should be uninterruptible. If we need to get the lock deleted, and there's a problem, interrupting the sleep will cause the code to eventually exit and locks will be leaked. It also requires a bunch more TLC. > Review of ZooKeeperHiveLockManager > ---------------------------------- > > Key: HIVE-21469 > URL: https://issues.apache.org/jira/browse/HIVE-21469 > Project: Hive > Issue Type: Improvement > Components: Locking > Affects Versions: 4.0.0, 3.2.0 > Reporter: David Mollitor > Assignee: David Mollitor > Priority: Major > Attachments: HIVE-21469.1.patch > > > A lot of sins in this class to resolve: > {code:java} > @Override > public void setContext(HiveLockManagerCtx ctx) throws LockException { > try { > curatorFramework = CuratorFrameworkSingleton.getInstance(conf); > parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE); > try{ > curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" + parent, new byte[0]); > } catch (Exception e) { > // ignore if the parent already exists > if (!(e instanceof KeeperException) || ((KeeperException)e).code() != KeeperException.Code.NODEEXISTS) { > LOG.warn("Unexpected ZK exception when creating parent node /" + parent, e); > } > } > {code} > Every time a new session is created and this {{setContext}} method is called, it attempts to create the root node. I have seen that, even though the root node exists, an create node action is written to the ZK logs. Check first if the node exists before trying to create it. > {code:java} > try { > curatorFramework.delete().forPath(zLock.getPath()); > } catch (InterruptedException ie) { > curatorFramework.delete().forPath(zLock.getPath()); > } > {code} > There has historically been a quite a few bugs regarding leaked locks. The Driver will signal the session {{Thread}} by performing an interrupt. That interrupt can happen any time and it can kill a create/delete action within the ZK framework. We can see one example of workaround for this. If the ZK action is interrupted, simply do it again. Well, what if it's interrupted yet again? The lock will be leaked. Also, when the {{InterruptedException}} is caught in the try block, the thread's interrupted flag is cleared. The flag is not reset in this code and therefore we lose the fact that this thread has been interrupted. This flag should be preserved so that other code paths will know that it's time to exit. > {code:java} > if (tryNum > 1) { > Thread.sleep(sleepTime); > } > unlockPrimitive(hiveLock, parent, curatorFramework); > break; > } catch (Exception e) { > if (tryNum >= numRetriesForUnLock) { > String name = ((ZooKeeperHiveLock)hiveLock).getPath(); > throw new LockException("Node " + name + " can not be deleted after " + numRetriesForUnLock + " attempts.", > e); > } > } > {code} > ... related... the sleep here may be interrupted, but we still need to delete the lock (again, for fear of leaking it). This sleep should be uninterruptible. If we need to get the lock deleted, and there's a problem, interrupting the sleep will cause the code to eventually exit and locks will be leaked. > It also requires a bunch more TLC. -- This message was sent by Atlassian JIRA (v7.6.3#76005)