hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ramkrishna.s.vasudevan (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-4212) TestMasterFailover fails occasionally
Date Thu, 29 Sep 2011 09:36:46 GMT

    [ https://issues.apache.org/jira/browse/HBASE-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13117176#comment-13117176
] 

ramkrishna.s.vasudevan commented on HBASE-4212:
-----------------------------------------------

@Ted/@Gao

I think the issue exists in 0.90 branch and not in trunk.  Gao's analysis is valid in 0.90.
In 0.90
{code}
assignRootAndMeta();

    // Is this fresh start with no regions assigned or are we a master joining
    // an already-running cluster?  If regionsCount == 0, then for sure a
    // fresh start.  TOOD: Be fancier.  If regionsCount == 2, perhaps the
    // 2 are .META. and -ROOT- and we should fall into the fresh startup
    // branch below.  For now, do processFailover.
    if (regionCount == 0) {
      LOG.info("Master startup proceeding: cluster startup");
      this.assignmentManager.cleanoutUnassigned();
      this.assignmentManager.assignAllUserRegions();
    } else {
      LOG.info("Master startup proceeding: master failover");
      this.assignmentManager.processFailover();
    }
{code}
Inside AssignmentManager.processFailOver()
{code}
 HServerInfo hsi =
      this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
    regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi);
    hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
    regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi);
{code}
we first call regionOnline inside which the region in RIT is getting removed.
So even before the call back happens for OPENED state it is getting removed.  Hence ROOT node
remains in RIT and testcase timesout.
In trunk we dont have this problem as in Assignment.joinCluster() we dont have this code of
removing from regionOnline.
{code}
    // Scan META to build list of existing regions, servers, and assignment
    // Returns servers who have not checked in (assumed dead) and their regions
    Map<ServerName,List<Pair<HRegionInfo,Result>>> deadServers =
      rebuildUserRegions();
{code}
I will now verify this testcase with the patch with the script in HBASE-4480.  
Gao and Ted what do you suggest?


                
> TestMasterFailover fails occasionally
> -------------------------------------
>
>                 Key: HBASE-4212
>                 URL: https://issues.apache.org/jira/browse/HBASE-4212
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.4
>            Reporter: gaojinchao
>            Assignee: gaojinchao
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4212_TrunkV1.patch, HBASE-4212_branch90V1.patch
>
>
> It seems a bug. The root in RIT can't be moved..
> In the failover process, it enforces root on-line. But not clean zk node. 
> test will wait forever.
>   void processFailover() throws KeeperException, IOException, InterruptedException {
>      
>     // we enforce on-line root.
>     HServerInfo hsi =
>       this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
>     regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi);
>     hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
>     regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi);
> It seems that we should wait finished as meta region 
>   int assignRootAndMeta()
>   throws InterruptedException, IOException, KeeperException {
>     int assigned = 0;
>     long timeout = this.conf.getLong("hbase.catalog.verification.timeout", 1000);
>     // Work on ROOT region.  Is it in zk in transition?
>     boolean rit = this.assignmentManager.
>       processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO);
>     if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>       this.assignmentManager.assignRoot();
>       this.catalogTracker.waitForRoot();
>       //we need add this code and guarantee that the transition has completed
>       this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO);
>       assigned++;
>     }
> logs:
> 2011-08-16 07:45:40,715 DEBUG [RegionServer:0;C4S2.site,47710,1313495126115-EventThread]
zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 Received ZooKeeper Event,
type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052
> 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(712):
regionserver:47710-0x131d2690f780004 Successfully transitioned node 70236052 from RS_ZK_REGION_OPENING
to RS_ZK_REGION_OPENING
> 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] zookeeper.ZooKeeperWatcher(252):
master:60701-0x131d2690f780009 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected,
path=/hbase/unassigned/70236052
> 2011-08-16 07:45:40,716 INFO  [PostOpenDeployTasks:70236052] catalog.RootLocationEditor(62):
Setting ROOT region location in ZooKeeper as C4S2.site:47710
> 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009
Retrieved 52 byte(s) of data from znode /hbase/unassigned/70236052 and set watcher; region=-ROOT-,,0,
server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING
> 2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] master.AssignmentManager(477):
Handling transition=RS_ZK_REGION_OPENING, server=C4S2.site,47710,1313495126115, region=70236052/-ROOT-
> 2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(661):
regionserver:47710-0x131d2690f780004 Attempting to transition node 70236052/-ROOT- from RS_ZK_REGION_OPENING
to RS_ZK_REGION_OPENED
> 2011-08-16 07:45:40,727 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKUtil(1109):
regionserver:47710-0x131d2690f780004 Retrieved 52 byte(s) of data from znode /hbase/unassigned/70236052;
data=region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING
> 2011-08-16 07:45:40,740 DEBUG [RegionServer:0;C4S2.site,47710,1313495126115-EventThread]
zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 Received ZooKeeper Event,
type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052
> 2011-08-16 07:45:40,740 DEBUG [Thread-760-EventThread] zookeeper.ZooKeeperWatcher(252):
master:60701-0x131d2690f780009 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected,
path=/hbase/unassigned/70236052
> 2011-08-16 07:45:40,740 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(712):
regionserver:47710-0x131d2690f780004 Successfully transitioned node 70236052 from RS_ZK_REGION_OPENING
to RS_ZK_REGION_OPENED
> 2011-08-16 07:45:40,741 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] handler.OpenRegionHandler(121):
Opened -ROOT-,,0.70236052
> 2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009
Retrieved 52 byte(s) of data from znode /hbase/unassigned/70236052 and set watcher; region=-ROOT-,,0,
server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENED
> 2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] master.AssignmentManager(477):
Handling transition=RS_ZK_REGION_OPENED, server=C4S2.site,47710,1313495126115, region=70236052/-ROOT-
> //.............................................It said that zk node can't be cleaned
because of we have enforced on-line the root.......................................
> // The test will wait forever.
> 2011-08-16 07:45:40,741 WARN  [Thread-760-EventThread] master.AssignmentManager(540):
Received OPENED for region 70236052/-ROOT- from server C4S2.site,47710,1313495126115 but region
was in  the state null and not in expected PENDING_OPEN or OPENING states
> 2011-08-16 07:45:41,018 DEBUG [Master:0;C4S2.site:60701] zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009
Retrieved 52 byte(s) of data from znode /hbase/unassigned/70236052 and set watcher; region=-ROOT-,,0,
server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENED
> 2011-08-16 07:45:41,233 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RIT -> 70236052
> 2011-08-16 07:45:41,337 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RIT -> 70236052
> 2011-08-16 07:45:41,439 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RIT -> 70236052
> 2011-08-16 07:45:41,543 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RIT -> 70236052
> 2011-08-16 07:45:41,645 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RIT -> 70236052
> 2011-08-16 07:45:41,748 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RIT -> 70236052
> 2011-08-16 07:45:41,900 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RIT -> 70236052
> 2011-08-16 07:45:42,002 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RIT -> 70236052
> 2011-08-16 07:45:42,105 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RIT -> 70236052
> 2011-08-16 07:45:42,206 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RIT -> 70236052
> 2011-08-16 07:45:42,308 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RIT -> 70236052
> 2011-08-16 07:45:42,410 DEBUG [Thread-760] zookeeper.ZKAssign(807): ZK RIT -> 70236052

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message