hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-3047) If new master crashes, restart is messy
Date Tue, 28 Sep 2010 21:42:32 GMT

     [ https://issues.apache.org/jira/browse/HBASE-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

stack updated HBASE-3047:

    Attachment: 3047.txt

M src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java
  Add test of case where HRegionInterface connection throws a
  ConnectionException. Also tests two new verify root and meta 
  locations added to CatalogTracker.
M src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
  Change order in which we start up trackers in ZK.  Also add blocking
  until master is up to make it less likely we'll start before master
  comes up, especially around the cluster start up situation.
M src/main/java/org/apache/hadoop/hbase/master/HMaster.java
  Introduce new state on startup, the case where the cluster is
  NOT a fresh startup and its NOT a cluster where all is fully
  assigned.  The repair the master needs run to fixup this new
  state is not yet done; we throw a NotImplementedException for
  now.  TODO.  Added new isRunningCluster checker used figuring
  what the cluster condition is when master is joining.  Not
  comprehensive but good enough for now.
M src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
  Added new verifyRootRegionLocation and verifyMetaRegionLocation.
  Needed to verify whats in zk is actually locations of catalog
M src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java
  Add fact that the verifying method, getRegionInfo, can throw

> If new master crashes, restart is messy
> ---------------------------------------
>                 Key: HBASE-3047
>                 URL: https://issues.apache.org/jira/browse/HBASE-3047
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.90.0
>         Attachments: 3047.txt
> If master crashes, the cluster-is-up flag is left stuck on.
> On restart of cluster, regionservers may come up before the master.  They'll have registered
themselves in zk by time the master assumes its role and master will think its joining an
up and running cluster when in fact this is a fresh startup.  Other probs. are that there'll
be a root region that is bad up in zk.  Same for meta and at moment we're not handling bad
root and meta very well.
> Here's sample of kinda of issues we're running into:
> {code}
> 2010-09-25 23:53:13,938 FATAL org.apache.hadoop.hbase.master.HMaster:
> Unhandled exception. Starting shutdown.
> java.io.IOException: Call to / failed on local
> exception: java.io.IOException: Connection reset by peer
>    at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:781)
>    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:255)
>    at $Proxy1.getProtocolVersion(Unknown Source)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:412)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:388)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:435)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:345)
>    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:889)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:350)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getRootServerConnection(CatalogTracker.java:209)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:241)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:286)
>    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:326)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:157)
>    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:140)
>    at org.apache.hadoop.hbase.master.AssignmentManager.rebuildUserRegions(AssignmentManager.java:753)
>    at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:174)
>    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:314)
> Caused by: java.io.IOException: Connection reset by peer
>    at sun.nio.ch.FileDispatcher.read0(Native Method)
>    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>    at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> {code}
> Notice, we think its a case of processFailover so we think we can just scan meta to fixup
our inmemory picture of the running cluster, only the scan of meta fails because the meta
isn not assigned.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message