Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C4EE1EFD2 for ; Mon, 3 Dec 2012 10:53:59 +0000 (UTC) Received: (qmail 12527 invoked by uid 500); 3 Dec 2012 10:53:59 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 12424 invoked by uid 500); 3 Dec 2012 10:53:59 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 12363 invoked by uid 99); 3 Dec 2012 10:53:58 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Dec 2012 10:53:58 +0000 Date: Mon, 3 Dec 2012 10:53:58 +0000 (UTC) From: "liwei (JIRA)" To: issues@hbase.apache.org Message-ID: <1143104292.52805.1354532038501.JavaMail.jiratomcat@arcas> In-Reply-To: <17654609.52245.1354520398842.JavaMail.jiratomcat@arcas> Subject: [jira] [Updated] (HBASE-7259) Deadlock in HBaseClient when KeeperException occured MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-7259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HBASE-7259: ------------------------- Attachment: HConnectionManager.patch > Deadlock in HBaseClient when KeeperException occured > ---------------------------------------------------- > > Key: HBASE-7259 > URL: https://issues.apache.org/jira/browse/HBASE-7259 > Project: HBase > Issue Type: Bug > Components: Zookeeper > Affects Versions: 0.94.0, 0.94.1, 0.94.2 > Reporter: liwei > Priority: Critical > Attachments: HConnectionManager.patch > > > HBaseClient was running after a period of time, all of get operation became too slow. > From the client logs I could see the following: > 1. Unable to get data of znode /hbase/root-region-server > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1253) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1129) > at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:264) > at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:522) > at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:498) > at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.getData(ZooKeeperNodeTracker.java:156) > at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.getRootRegionLocation(RootRegionTracker.java:62) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:821) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:933) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:832) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801) > at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:234) > at org.apache.hadoop.hbase.client.HTable.(HTable.java:174) > at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:150) > at org.apache.hadoop.hbase.client.MetaScanner.access$000(MetaScanner.java:48) > at org.apache.hadoop.hbase.client.MetaScanner$1.connect(MetaScanner.java:126) > at org.apache.hadoop.hbase.client.MetaScanner$1.connect(MetaScanner.java:123) > at org.apache.hadoop.hbase.client.HConnectionManager.execute(HConnectionManager.java:359) > at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:123) > at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:99) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:894) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:948) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725) > at org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82) > at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162) > at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685) > at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366) > 2. Catalina.out found one Java-level deadlock: > ============================= > "catalina-exec-800": > waiting to lock monitor 0x000000005f1f6530 (object 0x0000000731902200, a java.lang.Object), > which is held by "catalina-exec-710" > "catalina-exec-710": > waiting to lock monitor 0x00002aaab9a05bd0 (object 0x00000007321f8708, a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation), > which is held by "catalina-exec-29-EventThread" > "catalina-exec-29-EventThread": > waiting to lock monitor 0x000000005f9f0af0 (object 0x0000000732a9c7e0, a org.apache.hadoop.hbase.zookeeper.RootRegionTracker), > which is held by "catalina-exec-710" > Java stack information for the threads listed above: > =================================================== > "catalina-exec-800": > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:943) > - waiting to lock <0x0000000731902200> (a java.lang.Object) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725) > at org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82) > at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162) > at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685) > at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366) > "catalina-exec-710": > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:599) > - waiting to lock <0x00000007321f8708> (a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660) > at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.getData(ZooKeeperNodeTracker.java:158) > - locked <0x0000000732a9c7e0> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker) > at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.getRootRegionLocation(RootRegionTracker.java:62) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:821) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:933) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:832) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801) > at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:234) > at org.apache.hadoop.hbase.client.HTable.(HTable.java:174) > at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:150) > at org.apache.hadoop.hbase.client.MetaScanner.access$000(MetaScanner.java:48) > at org.apache.hadoop.hbase.client.MetaScanner$1.connect(MetaScanner.java:126) > at org.apache.hadoop.hbase.client.MetaScanner$1.connect(MetaScanner.java:123) > at org.apache.hadoop.hbase.client.HConnectionManager.execute(HConnectionManager.java:359) > at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:123) > at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:99) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:894) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:948) > - locked <0x0000000731902200> (a java.lang.Object) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725) > at org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82) > at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162) > at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685) > at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366) > "catalina-exec-29-EventThread": > at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.stop(ZooKeeperNodeTracker.java:98) > - waiting to lock <0x0000000732a9c7e0> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:604) > - locked <0x00000007321f8708> (a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660) > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:374) > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:271) > at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497) > Found 1 deadlock. > From the source code , the reason for this problem is doing ZooKeeperNodeTracker.getData that lead to KeeperException. And try to resetZookeeperTracker. At the same time, ClientCnxn.EventThread also do resetZookeeperTracker ,too. Because of getData have already held the lock of ZooKeeperNodeTracke , that lead to the order of the lock two threads to obtain does not accord. So deadlock happened. > In order to avoid the problem, we can add if reseting condition in abortable.abort() > See the patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira