Return-Path: Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: (qmail 4770 invoked from network); 22 Mar 2011 22:04:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 22 Mar 2011 22:04:43 -0000 Received: (qmail 42736 invoked by uid 500); 22 Mar 2011 22:04:43 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 42717 invoked by uid 500); 22 Mar 2011 22:04:43 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 42708 invoked by uid 99); 22 Mar 2011 22:04:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Mar 2011 22:04:43 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Mar 2011 22:04:42 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id B26AE42A29 for ; Tue, 22 Mar 2011 22:04:05 +0000 (UTC) Date: Tue, 22 Mar 2011 22:04:05 +0000 (UTC) From: "Jean-Daniel Cryans (JIRA)" To: issues@hbase.apache.org Message-ID: <1545343849.4779.1300831445727.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <727551499.10295.1300399109609.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HBASE-3668) CatalogTracker.waitForMeta can wait forever and totally stall a RS MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009864#comment-13009864 ] Jean-Daniel Cryans commented on HBASE-3668: ------------------------------------------- Will commit with metaAvailable.wait(timeout == 0 ? 50 : timeout); > CatalogTracker.waitForMeta can wait forever and totally stall a RS > ------------------------------------------------------------------ > > Key: HBASE-3668 > URL: https://issues.apache.org/jira/browse/HBASE-3668 > Project: HBase > Issue Type: Bug > Affects Versions: 0.90.1 > Reporter: Jean-Daniel Cryans > Assignee: Jean-Daniel Cryans > Priority: Blocker > Fix For: 0.90.2 > > Attachments: HBASE-3668.patch > > > We got into a weird situation after hitting HBASE-3664 leaving almost half the region servers unable to open regions (we didn't kill the master but all RS were restarted). > Here's the relevant jstack from the region servers: > {code} > "regionserver60020-EventThread" daemon prio=10 tid=0x0000000041297800 nid=0x5c5 in Object.wait() [0x00007fbd2b7f6000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean) > at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:328) > - locked <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean) > at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:363) > at org.apache.hadoop.hbase.zookeeper.MetaNodeTracker.nodeDeleted(MetaNodeTracker.java:64) > at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeCreated(ZooKeeperNodeTracker.java:154) > - locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker) > at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDataChanged(ZooKeeperNodeTracker.java:179) > - locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker) > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268) > at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) > {code} > Every server in that state cannot receive any ZK event. See how we first do a nodeDataChanged, then nodeCreated, then nodeDeleted. This is correlated in the logs: > {code} > 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020-0x12d627b723e081f Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/1028785192 > 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error) > 2011-03-17 17:57:03,637 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Set watcher on existing znode /hbase/unassigned/1028785192 > 2011-03-17 17:57:03,637 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected completed assignment of META, notifying catalog tracker > 2011-03-17 17:57:06,988 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Lookedup root region location, connection=org. > {code} > Node is updated, then is missing, than came back! The reason we're stuck is that CT.waitForMeta will wait forever: > {code} > if (getMetaServerConnection(true) != null) { > return metaLocation; > } > while(!stopped && !metaAvailable.get() && > (timeout == 0 || System.currentTimeMillis() < stop)) { > metaAvailable.wait(timeout); > } > {code} > So when it tried getMetaServerConnection the first time it didn't work, and then since the timeout is 0 then it waits. Even if it was looping, metaAvailable will never be true since we're already inside the event thread thus blocking any other event! > This is what happens when the RS then tries to update .META. after opening a region: > {code} > 2011-03-17 17:59:06,557 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Interrupting thread Thread[PostOpenDeployTasks:f06b630822a161d0a8fe37481d851d05,5,main] > 2011-03-17 17:59:06,557 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception running postOpenDeployTasks; region=f06b630822a161d0a8fe37481d851d05 > org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Interrupted > at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:365) > at org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142) > at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1359) > {code} > The master eventually tries to assign the region somewhere else where it may work, but then the balancer will screw everything up again. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira