Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C098A8A7C for ; Fri, 9 Sep 2011 13:58:32 +0000 (UTC) Received: (qmail 9047 invoked by uid 500); 9 Sep 2011 13:58:32 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 8976 invoked by uid 500); 9 Sep 2011 13:58:31 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 8967 invoked by uid 99); 9 Sep 2011 13:58:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Sep 2011 13:58:31 +0000 X-ASF-Spam-Status: No, hits=-2000.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Sep 2011 13:58:29 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id E071C8A67E for ; Fri, 9 Sep 2011 13:58:08 +0000 (UTC) Date: Fri, 9 Sep 2011 13:58:08 +0000 (UTC) From: "gaojinchao (JIRA)" To: issues@hbase.apache.org Message-ID: <1136268287.8023.1315576688916.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1902924984.44532.1313569587201.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HBASE-4212) TestMasterFailover fails occasionally MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-4212?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1310= 1213#comment-13101213 ]=20 gaojinchao commented on HBASE-4212: ----------------------------------- @Stack=EF=BC=8C Thanks for your review.=20 In our environment, it often fails, so we skip this case(for my case is tha= t all test cases are performed automatically every day).=20 The step for opening a root region: step A: Master tells Region server to open root region. step B: Region server opens root region and sets zk node(rootServerZNodezk)= . It is finished means that catalogtracker can works. step C: Region server updates the zk node(assignmentZNode) tells master tha= t root has opened(some cases may fail, but we have told the root could be u= sed). step D: Master deletes the zk node (assignmentZNode) and adds root region t= o online set. In my case, master skipped the step D because delayed. master forced root r= egion online in processFailover. So zk node couldn't be deleted and failove= r case failed. finishInitialization code=EF=BC=9A // Make sure root and meta assigned before proceeding. assignRootAndMeta(); // Is this fresh start with no regions assigned or are we a master join= ing // an already-running cluster? If regionsCount =3D=3D 0, then for sure= a // fresh start. TOOD: Be fancier. If regionsCount =3D=3D 2, perhaps t= he // 2 are .META. and -ROOT- and we should fall into the fresh startup // branch below. For now, do processFailover. if (regionCount =3D=3D 0) { LOG.info("Master startup proceeding: cluster startup"); this.assignmentManager.cleanoutUnassigned(); this.assignmentManager.assignAllUserRegions(); } else { LOG.info("Master startup proceeding: master failover"); this.assignmentManager.processFailover(); } processFailover code: HServerInfo hsi =3D this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation= ()); regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi); hsi =3D this.serverManager.getHServerInfo(this.catalogTracker.getRootLo= cation()); regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi); > TestMasterFailover fails occasionally > ------------------------------------- > > Key: HBASE-4212 > URL: https://issues.apache.org/jira/browse/HBASE-4212 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.90.4 > Reporter: gaojinchao > Assignee: gaojinchao > Fix For: 0.90.5 > > Attachments: HBASE-4212_TrunkV1.patch, HBASE-4212_branch90V1.patc= h > > > It seems a bug. The root in RIT can't be moved.. > In the failover process, it enforces root on-line. But not clean zk node.= =20 > test will wait forever. > void processFailover() throws KeeperException, IOException, Interrupted= Exception { > =20 > // we enforce on-line root. > HServerInfo hsi =3D > this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocati= on()); > regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi); > hsi =3D this.serverManager.getHServerInfo(this.catalogTracker.getRoot= Location()); > regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi); > It seems that we should wait finished as meta region=20 > int assignRootAndMeta() > throws InterruptedException, IOException, KeeperException { > int assigned =3D 0; > long timeout =3D this.conf.getLong("hbase.catalog.verification.timeou= t", 1000); > // Work on ROOT region. Is it in zk in transition? > boolean rit =3D this.assignmentManager. > processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REG= IONINFO); > if (!catalogTracker.verifyRootRegionLocation(timeout)) { > this.assignmentManager.assignRoot(); > this.catalogTracker.waitForRoot(); > //we need add this code and guarantee that the transition has compl= eted > this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINF= O); > assigned++; > } > logs: > 2011-08-16 07:45:40,715 DEBUG [RegionServer:0;C4S2.site,47710,13134951261= 15-EventThread] zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2= 690f780004 Received ZooKeeper Event, type=3DNodeDataChanged, state=3DSyncCo= nnected, path=3D/hbase/unassigned/70236052 > 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115= -0] zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfu= lly transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OP= ENING > 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] zookeeper.ZooKeepe= rWatcher(252): master:60701-0x131d2690f780009 Received ZooKeeper Event, typ= e=3DNodeDataChanged, state=3DSyncConnected, path=3D/hbase/unassigned/702360= 52 > 2011-08-16 07:45:40,716 INFO [PostOpenDeployTasks:70236052] catalog.Root= LocationEditor(62): Setting ROOT region location in ZooKeeper as C4S2.site:= 47710 > 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1= 109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) of data from znod= e /hbase/unassigned/70236052 and set watcher; region=3D-ROOT-,,0, server=3D= C4S2.site,47710,1313495126115, state=3DRS_ZK_REGION_OPENING > 2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] master.AssignmentM= anager(477): Handling transition=3DRS_ZK_REGION_OPENING, server=3DC4S2.site= ,47710,1313495126115, region=3D70236052/-ROOT- > 2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115= -0] zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attemptin= g to transition node 70236052/-ROOT- from RS_ZK_REGION_OPENING to RS_ZK_REG= ION_OPENED > 2011-08-16 07:45:40,727 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115= -0] zookeeper.ZKUtil(1109): regionserver:47710-0x131d2690f780004 Retrieved = 52 byte(s) of data from znode /hbase/unassigned/70236052; data=3Dregion=3D-= ROOT-,,0, server=3DC4S2.site,47710,1313495126115, state=3DRS_ZK_REGION_OPEN= ING > 2011-08-16 07:45:40,740 DEBUG [RegionServer:0;C4S2.site,47710,13134951261= 15-EventThread] zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2= 690f780004 Received ZooKeeper Event, type=3DNodeDataChanged, state=3DSyncCo= nnected, path=3D/hbase/unassigned/70236052 > 2011-08-16 07:45:40,740 DEBUG [Thread-760-EventThread] zookeeper.ZooKeepe= rWatcher(252): master:60701-0x131d2690f780009 Received ZooKeeper Event, typ= e=3DNodeDataChanged, state=3DSyncConnected, path=3D/hbase/unassigned/702360= 52 > 2011-08-16 07:45:40,740 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115= -0] zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfu= lly transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OP= ENED > 2011-08-16 07:45:40,741 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115= -0] handler.OpenRegionHandler(121): Opened -ROOT-,,0.70236052 > 2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1= 109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) of data from znod= e /hbase/unassigned/70236052 and set watcher; region=3D-ROOT-,,0, server=3D= C4S2.site,47710,1313495126115, state=3DRS_ZK_REGION_OPENED > 2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] master.AssignmentM= anager(477): Handling transition=3DRS_ZK_REGION_OPENED, server=3DC4S2.site,= 47710,1313495126115, region=3D70236052/-ROOT- > //.............................................It said that zk node can't= be cleaned because of we have enforced on-line the root...................= .................... > // The test will wait forever. > 2011-08-16 07:45:40,741 WARN [Thread-760-EventThread] master.AssignmentM= anager(540): Received OPENED for region 70236052/-ROOT- from server C4S2.si= te,47710,1313495126115 but region was in the state null and not in expecte= d PENDING_OPEN or OPENING states > 2011-08-16 07:45:41,018 DEBUG [Master:0;C4S2.site:60701] zookeeper.ZKUtil= (1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) of data from zn= ode /hbase/unassigned/70236052 and set watcher; region=3D-ROOT-,,0, server= =3DC4S2.site,47710,1313495126115, state=3DRS_ZK_REGION_OPENED > 2011-08-16 07:45:41,233 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RI= T -> 70236052 > 2011-08-16 07:45:41,337 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RI= T -> 70236052 > 2011-08-16 07:45:41,439 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RI= T -> 70236052 > 2011-08-16 07:45:41,543 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RI= T -> 70236052 > 2011-08-16 07:45:41,645 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RI= T -> 70236052 > 2011-08-16 07:45:41,748 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RI= T -> 70236052 > 2011-08-16 07:45:41,900 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RI= T -> 70236052 > 2011-08-16 07:45:42,002 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RI= T -> 70236052 > 2011-08-16 07:45:42,105 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RI= T -> 70236052 > 2011-08-16 07:45:42,206 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RI= T -> 70236052 > 2011-08-16 07:45:42,308 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RI= T -> 70236052 > 2011-08-16 07:45:42,410 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RI= T -> 70236052 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira