Return-Path: Delivered-To: apmail-hadoop-hbase-dev-archive@minotaur.apache.org Received: (qmail 55463 invoked from network); 22 May 2009 19:32:02 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 22 May 2009 19:32:02 -0000 Received: (qmail 5443 invoked by uid 500); 22 May 2009 19:32:14 -0000 Delivered-To: apmail-hadoop-hbase-dev-archive@hadoop.apache.org Received: (qmail 5418 invoked by uid 500); 22 May 2009 19:32:14 -0000 Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-dev@hadoop.apache.org Delivered-To: mailing list hbase-dev@hadoop.apache.org Received: (qmail 5408 invoked by uid 99); 22 May 2009 19:32:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 May 2009 19:32:14 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 May 2009 19:32:05 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 88A5129A0011 for ; Fri, 22 May 2009 12:31:45 -0700 (PDT) Message-ID: <1703749895.1243020705548.JavaMail.jira@brutus> Date: Fri, 22 May 2009 12:31:45 -0700 (PDT) From: "Jean-Daniel Cryans (JIRA)" To: hbase-dev@hadoop.apache.org Subject: [jira] Commented: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master In-Reply-To: <764955780.1238522451113.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712226#action_12712226 ] Jean-Daniel Cryans commented on HBASE-1302: ------------------------------------------- I actually tried to do the same, I didn't get the "failed to create" exception but got this (it never stops): {code} 2009-05-22 14:59:48,126 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master for 445473 milliseconds - retrying 2009-05-22 14:59:49,127 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 0 time(s). 2009-05-22 14:59:50,128 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 1 time(s). 2009-05-22 14:59:51,129 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 2 time(s). 2009-05-22 14:59:52,129 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 3 time(s). 2009-05-22 14:59:53,130 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 4 time(s). 2009-05-22 14:59:54,131 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 5 time(s). 2009-05-22 14:59:55,132 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 6 time(s). 2009-05-22 14:59:56,132 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 7 time(s). 2009-05-22 14:59:57,133 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 8 time(s). 2009-05-22 14:59:58,134 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect to server: /192.168.1.81:62000. Already tried 9 time(s). 2009-05-22 14:59:58,135 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Exceeded max retries: 10 {code} We don't get this forever when the master is restarted on the same node because HRS.hbaseMaster is at the same place. In fact the problem is in this code: {code} public void process(WatchedEvent event) { EventType type = event.getType(); KeeperState state = event.getState(); LOG.info("Got ZooKeeper event, state: " + state + ", type: " + type + ", path: " + event.getPath()); // Ignore events if we're shutting down. if (stopRequested.get()) { LOG.debug("Ignoring ZooKeeper event while shutting down"); return; } if (state == KeeperState.Expired) { LOG.error("ZooKeeper session expired"); restart(); } else if (type == EventType.NodeCreated) { getMaster(); // ZooKeeper watches are one time only, so we need to re-register our watch. watchMasterAddress(); } } {code} I see that the node is deleted but I never see it being created because we don't set a watch after a NodeDeleted tho we should because we will never know when the master comes back. This should be changed. Instead, we have set a watch when the master node is deleted and then set a watch on the folder to see when it's recreated. > When a new master comes up, regionservers should continue with their region assignments from the last master > ------------------------------------------------------------------------------------------------------------ > > Key: HBASE-1302 > URL: https://issues.apache.org/jira/browse/HBASE-1302 > Project: Hadoop HBase > Issue Type: Improvement > Components: master, regionserver > Affects Versions: 0.20.0 > Reporter: Nitay Joffe > Assignee: Jean-Daniel Cryans > Fix For: 0.20.0 > > Attachments: hbase-1302-v1.patch, hbase-1302-v2.patch > > > After HBASE-1205, we can now handle a master going down and coming up somewhere else. When this happens, the new master will scan everything and reassign all the regions, which is not ideal. Instead of doing that, we should keep the region assignments from the last master. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.