Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-dev@hadoop.apache.org
Message-ID: <1703749895.1243020705548.JavaMail.jira@brutus>
Date: Fri, 22 May 2009 12:31:45 -0700 (PDT)
From: "Jean-Daniel Cryans (JIRA)" <jira@apache.org>
To: hbase-dev@hadoop.apache.org
Subject: [jira] Commented: (HBASE-1302) When a new master comes up,
 regionservers should continue with their region assignments from the last
 master
In-Reply-To: <764955780.1238522451113.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712226#action_12712226 ] 

Jean-Daniel Cryans commented on HBASE-1302:
-------------------------------------------

I actually tried to do the same, I didn't get the "failed to create" exception but got this (it never stops): 

{code}
2009-05-22 14:59:48,126 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master for 445473 milliseconds - retrying
2009-05-22 14:59:49,127 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 0 time(s).
2009-05-22 14:59:50,128 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 1 time(s).
2009-05-22 14:59:51,129 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 2 time(s).
2009-05-22 14:59:52,129 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 3 time(s).
2009-05-22 14:59:53,130 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 4 time(s).
2009-05-22 14:59:54,131 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 5 time(s).
2009-05-22 14:59:55,132 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 6 time(s).
2009-05-22 14:59:56,132 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 7 time(s).
2009-05-22 14:59:57,133 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 8 time(s).
2009-05-22 14:59:58,134 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 9 time(s).
2009-05-22 14:59:58,135 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Exceeded max retries: 10
{code}

We don't get this forever when the master is restarted on the same node because HRS.hbaseMaster is at the same place. In fact the problem is in this code:

{code}
public void process(WatchedEvent event) {
    EventType type = event.getType();
    KeeperState state = event.getState();
    LOG.info("Got ZooKeeper event, state: " + state + ", type: " +
              type + ", path: " + event.getPath());

    // Ignore events if we're shutting down.
    if (stopRequested.get()) {
      LOG.debug("Ignoring ZooKeeper event while shutting down");
      return;
    }

    if (state == KeeperState.Expired) {
      LOG.error("ZooKeeper session expired");
      restart();
    } else if (type == EventType.NodeCreated) {
      getMaster();

      // ZooKeeper watches are one time only, so we need to re-register our watch.
      watchMasterAddress();
    }
  }
{code}

I see that the node is deleted but I never see it being created because we don't set a watch after a NodeDeleted tho we should because we will never know when the master comes back. This should be changed. Instead, we have set a watch when the master node is deleted and then set a watch on the folder to see when it's recreated. 

> When a new master comes up, regionservers should continue with their region assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>         Attachments: hbase-1302-v1.patch, hbase-1302-v2.patch
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.