hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nitay Joffe (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-1232) zookeeper client wont reconnect if there is a problem
Date Tue, 24 Mar 2009 06:01:51 GMT

    [ https://issues.apache.org/jira/browse/HBASE-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688578#action_12688578
] 

Nitay Joffe commented on HBASE-1232:
------------------------------------

When a SessionExpired occurs we will lose our ephemeral nodes. This means everyone else in
the cluster will think that node is down. To fix this we need to restart the node completely.

For example, if the master's connection to ZooKeeper throws SessionExpired it loses its ephemeral
address node in ZooKeeper and everyone will think the master has died. In fact, another master
may come up now that we have the HA master lock.

I'm writing the #restart() methods for HMaster and HRegionServer. Effectively it's just something
like:

{code}
  shutdown();
  run();
{code}

I notice that the shutdown/stop methods in those classes just set a flag which is later picked
up and causes a shutdown. How do I make sure the server is actually shutdown between the shutdown()
call and the run() call?

> zookeeper client wont reconnect if there is a problem
> -----------------------------------------------------
>
>                 Key: HBASE-1232
>                 URL: https://issues.apache.org/jira/browse/HBASE-1232
>             Project: Hadoop HBase
>          Issue Type: Bug
>         Environment: java 1.7, zookeeper 3.0.1
>            Reporter: ryan rawson
>            Assignee: Nitay Joffe
>            Priority: Critical
>             Fix For: 0.20.0
>
>
> my regionserver got wedged:
> 2009-03-02 15:43:30,938 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed
to create /hbase:
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired for /hbase
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:87)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:35)
>         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:482)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureExists(ZooKeeperWrapper.java:219)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureParentExists(ZooKeeperWrapper.java:240)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.checkOutOfSafeMode(ZooKeeperWrapper.java:328)
>         at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:783)
>         at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:468)
>         at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:443)
>         at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:518)
>         at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:477)
>         at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:450)
>         at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocation(HConnectionManager.java:295)
>         at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocationForRowWithRetries(HConnectionManager.java:919)
>         at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:950)
>         at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1370)
>         at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1314)
>         at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1294)
>         at org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:237)
>         at org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:216)
>         at org.apache.hadoop.hbase.RegionHistorian.addRegionSplit(RegionHistorian.java:174)
>         at org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:607)
>         at org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:174)
>         at org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:107)
> this message repeats over and over.  
> Looking at the code in question:
>   private boolean ensureExists(final String znode) {
>     try {
>       zooKeeper.create(znode, new byte[0],
>                        Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
>       LOG.debug("Created ZNode " + znode);
>       return true;
>     } catch (KeeperException.NodeExistsException e) {
>       return true;      // ok, move on.
>     } catch (KeeperException.NoNodeException e) {
>       return ensureParentExists(znode) && ensureExists(znode);
>     } catch (KeeperException e) {
>       LOG.warn("Failed to create " + znode + ":", e);
>     } catch (InterruptedException e) {
>       LOG.warn("Failed to create " + znode + ":", e);
>     }
>     return false;
>   }
> We need to catch this exception specifically and reopen the ZK connection.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message