hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jiraposter@reviews.apache.org (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-3867) when cluster is stopped and server which hosted meta region is removed from cluster, master breaks down after restarting cluster.
Date Wed, 06 Jul 2011 22:09:21 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060871#comment-13060871
] 

jiraposter@reviews.apache.org commented on HBASE-3867:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1019/
-----------------------------------------------------------

(Updated 2011-07-06 22:07:37.948275)


Review request for hbase.


Changes
-------

Removed additional call to master getting servers in the cluster.


Summary
-------

When cluster is stopped and removing server from cluster which contained meta region, then
restart cluster,
getCachedConnection() throws "NoRouteToHostException"

NoRouteToHostException is caught, similarly to how SocketTimeoutException is handled.

If there is uncaught IOException still, we ask Master for list of servers and obtains region
connection from one of them.


This addresses bug HBASE-3867.
    https://issues.apache.org/jira/browse/HBASE-3867


Diffs (updated)
-----

  /src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1143525 

Diff: https://reviews.apache.org/r/1019/diff


Testing
-------

Ran test suite.


Thanks,

Ted



> when cluster is stopped and server which hosted meta region is removed from cluster,
master breaks down after restarting cluster.
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3867
>                 URL: https://issues.apache.org/jira/browse/HBASE-3867
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.1, 0.90.2
>            Reporter: Liu Jia
>            Priority: Critical
>             Fix For: 0.90.2
>
>         Attachments: 3867-trunk-v2.txt, 3867-trunk-v3.txt, CatalogTracker.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When cluster stopped and romove server from cluster which contains meta region, then
restart cluster,
> From the following code throws "NoRouteToHostException"
> package org.apache.hadoop.hbase.catalog;
> public class CatalogTracker 
>  private HRegionInterface getMetaServerConnection(boolean refresh)
>   throws IOException, InterruptedException {
>     synchronized (metaAvailable) {
>       if (metaAvailable.get()) {
>         HRegionInterface current = getCachedConnection(metaLocation);
>         if (!refresh) {
>           return current;
>         }
>         if (verifyRegionLocation(current, this.metaLocation, META_REGION)) {
>           return current;
>         }
>         resetMetaLocation();
>       }
>       HRegionInterface rootConnection = getRootServerConnection();
>       if (rootConnection == null) {
>         return null;
>       }
>       HServerAddress newLocation = MetaReader.readMetaLocation(rootConnection);
>       if (newLocation == null) {
>         return null;
>       }
>       ////////the following line throws the exception
> HRegionInterface newConnection = getCachedConnection(newLocation);
>       if (verifyRegionLocation(newConnection, this.metaLocation, META_REGION)) {
>         setMetaLocation(newLocation);
>         return newConnection;
>       }
>       return null;
>     }
>   }
> /////////////the following method don't handle the exception.
> public class CatalogTracker 
>   public boolean verifyMetaRegionLocation(final long timeout)
>   throws InterruptedException, IOException {
>     return getMetaServerConnection(true) != null;
>   }
> //////////////////master call the CatalogTracker's method and don't handle the problem
too.
> package org.apache.hadoop.hbase.master;
> public class HMaster
> int assignRootAndMeta()
>   throws InterruptedException, IOException, KeeperException {
>     int assigned = 0;
>     long timeout = this.conf.getLong("hbase.catalog.verification.timeout", 1000);
>     // Work on ROOT region.  Is it in zk in transition?
>     boolean rit = this.assignmentManager.
>       processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO);
>     if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>       this.assignmentManager.assignRoot();
>       this.catalogTracker.waitForRoot();
>       assigned++;
>     }
>     LOG.info("-ROOT- assigned=" + assigned + ", rit=" + rit +
>       ", location=" + catalogTracker.getRootLocation());
>     // Work on meta region
>     rit = this.assignmentManager.
>       processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.FIRST_META_REGIONINFO);
> ///////////////////////////////
> when restart cluster master break down here.
> ////////////////////////////////
>     if (!this.catalogTracker.verifyMetaRegionLocation(timeout)) {
>       this.assignmentManager.assignMeta();
>       this.catalogTracker.waitForMeta();
>       // Above check waits for general meta availability but this does not
>       // guarantee that the transition has completed
>       this.assignmentManager.waitForAssignment(HRegionInfo.FIRST_META_REGIONINFO);
>       assigned++;
>     }
>     LOG.info(".META. assigned=" + assigned + ", rit=" + rit +
>       ", location=" + catalogTracker.getMetaLocation());
>     return assigned;
>   }
> Thanks to JunQiang Yuan in www.alipay.com  for providing information about this bug.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message