hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master
Date Fri, 22 May 2009 19:31:45 GMT

    [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712226#action_12712226
] 

Jean-Daniel Cryans commented on HBASE-1302:
-------------------------------------------

I actually tried to do the same, I didn't get the "failed to create" exception but got this
(it never stops): 

{code}
2009-05-22 14:59:48,126 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: unable to
report to master for 445473 milliseconds - retrying
2009-05-22 14:59:49,127 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server:
/192.168.1.81:62000. Already tried 0 time(s).
2009-05-22 14:59:50,128 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server:
/192.168.1.81:62000. Already tried 1 time(s).
2009-05-22 14:59:51,129 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server:
/192.168.1.81:62000. Already tried 2 time(s).
2009-05-22 14:59:52,129 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server:
/192.168.1.81:62000. Already tried 3 time(s).
2009-05-22 14:59:53,130 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server:
/192.168.1.81:62000. Already tried 4 time(s).
2009-05-22 14:59:54,131 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server:
/192.168.1.81:62000. Already tried 5 time(s).
2009-05-22 14:59:55,132 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server:
/192.168.1.81:62000. Already tried 6 time(s).
2009-05-22 14:59:56,132 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server:
/192.168.1.81:62000. Already tried 7 time(s).
2009-05-22 14:59:57,133 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server:
/192.168.1.81:62000. Already tried 8 time(s).
2009-05-22 14:59:58,134 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server:
/192.168.1.81:62000. Already tried 9 time(s).
2009-05-22 14:59:58,135 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Exceeded
max retries: 10
{code}

We don't get this forever when the master is restarted on the same node because HRS.hbaseMaster
is at the same place. In fact the problem is in this code:

{code}
public void process(WatchedEvent event) {
    EventType type = event.getType();
    KeeperState state = event.getState();
    LOG.info("Got ZooKeeper event, state: " + state + ", type: " +
              type + ", path: " + event.getPath());

    // Ignore events if we're shutting down.
    if (stopRequested.get()) {
      LOG.debug("Ignoring ZooKeeper event while shutting down");
      return;
    }

    if (state == KeeperState.Expired) {
      LOG.error("ZooKeeper session expired");
      restart();
    } else if (type == EventType.NodeCreated) {
      getMaster();

      // ZooKeeper watches are one time only, so we need to re-register our watch.
      watchMasterAddress();
    }
  }
{code}

I see that the node is deleted but I never see it being created because we don't set a watch
after a NodeDeleted tho we should because we will never know when the master comes back. This
should be changed. Instead, we have set a watch when the master node is deleted and then set
a watch on the folder to see when it's recreated. 

> When a new master comes up, regionservers should continue with their region assignments
from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>         Attachments: hbase-1302-v1.patch, hbase-1302-v2.patch
>
>
> After HBASE-1205, we can now handle a master going down and coming up somewhere else.
When this happens, the new master will scan everything and reassign all the regions, which
is not ideal. Instead of doing that, we should keep the region assignments from the last master.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message