hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-3668) CatalogTracker.waitForMeta can wait forever and totally stall a RS
Date Fri, 18 Mar 2011 20:52:29 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008624#comment-13008624

Jean-Daniel Cryans commented on HBASE-3668:


I understand that the bug report has a lot of information, but it does answer those questions.
Yes, the region servers were online. Yes, they got regions. But, as I showed, they were unable
to update .META. and eventually failed on NotAllMetaRegionsOnlineException. Thus, the master
saw the RIT timeout and then reassigned the regions, which may go back to a weird region server
or not. Eventually they are all opened, but then the balancer runs again. Lather, rinse, repeat.

> CatalogTracker.waitForMeta can wait forever and totally stall a RS
> ------------------------------------------------------------------
>                 Key: HBASE-3668
>                 URL: https://issues.apache.org/jira/browse/HBASE-3668
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.1
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.2
> We got into a weird situation after hitting HBASE-3664 leaving almost half the region
servers unable to open regions (we didn't kill the master but all RS were restarted).
> Here's the relevant jstack from the region servers:
> {code}
> "regionserver60020-EventThread" daemon prio=10 tid=0x0000000041297800 nid=0x5c5 in Object.wait()
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:328)
> 	- locked <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:363)
> 	at org.apache.hadoop.hbase.zookeeper.MetaNodeTracker.nodeDeleted(MetaNodeTracker.java:64)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeCreated(ZooKeeperNodeTracker.java:154)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDataChanged(ZooKeeperNodeTracker.java:179)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> {code}
> Every server in that state cannot receive any ZK event. See how we first do a nodeDataChanged,
then nodeCreated, then nodeDeleted. This is correlated in the logs:
> {code}
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020-0x12d627b723e081f
Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/1028785192
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f
Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not
an error)
> 2011-03-17 17:57:03,637 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f
Set watcher on existing znode /hbase/unassigned/1028785192
> 2011-03-17 17:57:03,637 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected
completed assignment of META, notifying catalog tracker
> 2011-03-17 17:57:06,988 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Lookedup root region location, connection=org.
> {code}
> Node is updated, then is missing, than came back! The reason we're stuck is that CT.waitForMeta
will wait forever:
> {code}
>      if (getMetaServerConnection(true) != null) {
>         return metaLocation;
>       }
>       while(!stopped && !metaAvailable.get() &&
>           (timeout == 0 || System.currentTimeMillis() < stop)) {
>         metaAvailable.wait(timeout);
>       }
> {code}
> So when it tried getMetaServerConnection the first time it didn't work, and then since
the timeout is 0 then it waits. Even if it was looping, metaAvailable will never be true since
we're already inside the event thread thus blocking any other event!
> This is what happens when the RS then tries to update .META. after opening a region:
> {code}
> 2011-03-17 17:59:06,557 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Interrupting thread Thread[PostOpenDeployTasks:f06b630822a161d0a8fe37481d851d05,5,main]
> 2011-03-17 17:59:06,557 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Exception running postOpenDeployTasks; region=f06b630822a161d0a8fe37481d851d05
> org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Interrupted
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:365)
>         at org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1359)
> {code}
> The master eventually tries to assign the region somewhere else where it may work, but
then the balancer will screw everything up again.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message