Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 187E318613 for ; Wed, 12 Aug 2015 23:17:46 +0000 (UTC) Received: (qmail 98535 invoked by uid 500); 12 Aug 2015 23:17:45 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 98477 invoked by uid 500); 12 Aug 2015 23:17:45 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 98464 invoked by uid 99); 12 Aug 2015 23:17:45 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Aug 2015 23:17:45 +0000 Date: Wed, 12 Aug 2015 23:17:45 +0000 (UTC) From: "Andrew Purtell (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-14207) Region was hijacked and remained in transition when RS failed to open a region and later regionplan changed to new RS on retry MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-14207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Purtell updated HBASE-14207: ----------------------------------- Fix Version/s: (was: 0.98.14) Status: Open (was: Patch Available) bq. org.apache.hadoop.hbase.master.TestZKLessAMOnCluster This looks like a relevant test failure. > Region was hijacked and remained in transition when RS failed to open a region and later regionplan changed to new RS on retry > ------------------------------------------------------------------------------------------------------------------------------ > > Key: HBASE-14207 > URL: https://issues.apache.org/jira/browse/HBASE-14207 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.98.6 > Reporter: Pankaj Kumar > Assignee: Pankaj Kumar > Priority: Critical > Fix For: 0.98.15 > > Attachments: HBASE-14207-0.98.patch > > > On production environment, following events happened > 1. Master is trying to assign a region to RS, but due to KeeperException$SessionExpiredException RS failed to open the region. > In RS log, saw multiple WARN log related to KeeperException$SessionExpiredException > > KeeperErrorCode = Session expired for /hbase/region-in-transition/08f1935d652e5dbdac09b423b8f9401b > > Unable to get data of znode /hbase/region-in-transition/08f1935d652e5dbdac09b423b8f9401b > 2. Master retried to assign the region to same RS, but RS again failed. > 3. On second retry new plan formed and this time plan destination (RS) is different, so master send the request to new RS to open the region. But new RS failed to open the region as there was server mismatch in ZNODE than the expected current server name. > Logs Snippet: > {noformat} > HM > 2015-07-14 03:50:29,759 | INFO | master:T101PC03VM13:21300 | Processing 08f1935d652e5dbdac09b423b8f9401b in state: M_ZK_REGION_OFFLINE | org.apache.hadoop.hbase.master.AssignmentManager.processRegionsInTransition(AssignmentManager.java:644) > 2015-07-14 03:50:29,759 | INFO | master:T101PC03VM13:21300 | Transitioned {08f1935d652e5dbdac09b423b8f9401b state=OFFLINE, ts=1436817029679, server=null} to {08f1935d652e5dbdac09b423b8f9401b state=PENDING_OPEN, ts=1436817029759, server=T101PC03VM13,21302,1436816690692} | org.apache.hadoop.hbase.master.RegionStates.updateRegionState(RegionStates.java:327) > 2015-07-14 03:50:29,760 | INFO | master:T101PC03VM13:21300 | Processed region 08f1935d652e5dbdac09b423b8f9401b in state M_ZK_REGION_OFFLINE, on server: T101PC03VM13,21302,1436816690692 | org.apache.hadoop.hbase.master.AssignmentManager.processRegionsInTransition(AssignmentManager.java:768) > 2015-07-14 03:50:29,800 | INFO | MASTER_SERVER_OPERATIONS-T101PC03VM13:21300-3 | Assigning INTER_CONCURRENCY_SETTING,,1436596137981.08f1935d652e5dbdac09b423b8f9401b. to T101PC03VM13,21302,1436816690692 | org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1983) > 2015-07-14 03:50:29,801 | WARN | MASTER_SERVER_OPERATIONS-T101PC03VM13:21300-3 | Failed assignment of INTER_CONCURRENCY_SETTING,,1436596137981.08f1935d652e5dbdac09b423b8f9401b. to T101PC03VM13,21302,1436816690692, trying to assign elsewhere instead; try=1 of 10 | org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2077) > 2015-07-14 03:50:29,802 | INFO | MASTER_SERVER_OPERATIONS-T101PC03VM13:21300-3 | Trying to re-assign INTER_CONCURRENCY_SETTING,,1436596137981.08f1935d652e5dbdac09b423b8f9401b. to the same failed server. | org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2123) > 2015-07-14 03:50:31,804 | INFO | MASTER_SERVER_OPERATIONS-T101PC03VM13:21300-3 | Assigning INTER_CONCURRENCY_SETTING,,1436596137981.08f1935d652e5dbdac09b423b8f9401b. to T101PC03VM13,21302,1436816690692 | org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1983) > 2015-07-14 03:50:31,806 | WARN | MASTER_SERVER_OPERATIONS-T101PC03VM13:21300-3 | Failed assignment of INTER_CONCURRENCY_SETTING,,1436596137981.08f1935d652e5dbdac09b423b8f9401b. to T101PC03VM13,21302,1436816690692, trying to assign elsewhere instead; try=2 of 10 | org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2077) > 2015-07-14 03:50:31,807 | INFO | MASTER_SERVER_OPERATIONS-T101PC03VM13:21300-3 | Transitioned {08f1935d652e5dbdac09b423b8f9401b state=PENDING_OPEN, ts=1436817031804, server=T101PC03VM13,21302,1436816690692} to {08f1935d652e5dbdac09b423b8f9401b state=OFFLINE, ts=1436817031807, server=T101PC03VM13,21302,1436816690692} | org.apache.hadoop.hbase.master.RegionStates.updateRegionState(RegionStates.java:327) > 2015-07-14 03:50:31,807 | INFO | MASTER_SERVER_OPERATIONS-T101PC03VM13:21300-3 | Assigning INTER_CONCURRENCY_SETTING,,1436596137981.08f1935d652e5dbdac09b423b8f9401b. to T101PC03VM14,21302,1436816997967 | org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1983) > 2015-07-14 03:50:31,807 | INFO | MASTER_SERVER_OPERATIONS-T101PC03VM13:21300-3 | Transitioned {08f1935d652e5dbdac09b423b8f9401b state=OFFLINE, ts=1436817031807, server=T101PC03VM13,21302,1436816690692} to {08f1935d652e5dbdac09b423b8f9401b state=PENDING_OPEN, ts=1436817031807, server=T101PC03VM14,21302,1436816997967} | org.apache.hadoop.hbase.master.RegionStates.updateRegionState(RegionStates.java:327) > 2015-07-14 03:51:09,501 | INFO | MASTER_SERVER_OPERATIONS-T101PC03VM13:21300-4 | Skip assigning region in transition on other server{08f1935d652e5dbdac09b423b8f9401b state=PENDING_OPEN, ts=1436817031807, server=T101PC03VM14,21302,1436816997967} | org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:250) > {noformat} > {noformat} > RS - T101PC03VM14 > 2015-07-14 03:50:31,809 | INFO | PriorityRpcServer.handler=2,queue=0,port=21302 | Open INTER_CONCURRENCY_SETTING,,1436596137981.08f1935d652e5dbdac09b423b8f9401b. | org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:3671) > 2015-07-14 03:50:31,830 | WARN | RS_OPEN_REGION-T101PC03VM14:21302-2 | regionserver:21302-0xe4e88f6f1b70002, quorum=t101pc03vm12:24002,t101pc03vm13:24002,t101pc03vm14:24002, baseZNode=/hbase Attempt to transition the unassigned node for 08f1935d652e5dbdac09b423b8f9401b from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the server that tried to transition was T101PC03VM14,21302,1436816997967 not the expected T101PC03VM13,21302,1436816690692 | org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:875) > 2015-07-14 03:50:31,830 | WARN | RS_OPEN_REGION-T101PC03VM14:21302-2 | Failed transition from OFFLINE to OPENING for region=08f1935d652e5dbdac09b423b8f9401b | org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.transitionZookeeperOfflineToOpening(OpenRegionHandler.java:539) > 2015-07-14 03:50:31,831 | WARN | RS_OPEN_REGION-T101PC03VM14:21302-2 | Region was hijacked? Opening cancelled for encodedName=08f1935d652e5dbdac09b423b8f9401b | org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:132) > 2015-07-14 03:50:31,831 | INFO | RS_OPEN_REGION-T101PC03VM14:21302-2 | Opening of region {ENCODED => 08f1935d652e5dbdac09b423b8f9401b, NAME => 'INTER_CONCURRENCY_SETTING,,1436596137981.08f1935d652e5dbdac09b423b8f9401b.', STARTKEY => '', ENDKEY => '200'} failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting version -1 | org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tryTransitionFromOfflineToFailedOpen(OpenRegionHandler.java:436) > 2015-07-14 03:50:31,834 | WARN | RS_OPEN_REGION-T101PC03VM14:21302-2 | regionserver:21302-0xe4e88f6f1b70002, quorum=t101pc03vm12:24002,t101pc03vm13:24002,t101pc03vm14:24002, baseZNode=/hbase Attempt to transition the unassigned node for 08f1935d652e5dbdac09b423b8f9401b from M_ZK_REGION_OFFLINE to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to transition was T101PC03VM14,21302,1436816997967 not the expected T101PC03VM13,21302,1436816690692 | org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:875) > 2015-07-14 03:50:31,834 | WARN | RS_OPEN_REGION-T101PC03VM14:21302-2 | Unable to mark region {ENCODED => 08f1935d652e5dbdac09b423b8f9401b, NAME => 'INTER_CONCURRENCY_SETTING,,1436596137981.08f1935d652e5dbdac09b423b8f9401b.', STARTKEY => '', ENDKEY => '200'} as FAILED_OPEN. It's likely that the master already timed out this open attempt, and thus another RS already has the region. | org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tryTransitionFromOfflineToFailedOpen(OpenRegionHandler.java:444) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)