hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-9721) RegionServer should not accept regionOpen RPC intended for another(previous) server
Date Tue, 14 Jan 2014 00:15:53 GMT

    [ https://issues.apache.org/jira/browse/HBASE-9721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870162#comment-13870162
] 

Enis Soztutar commented on HBASE-9721:
--------------------------------------

bq. What should we do here if we closeRegion?
I guess the AM will fail to close the region, but meanwhile the race condition will be gone,
and next try we will pick up the correct server, no? 

> RegionServer should not accept regionOpen RPC intended for another(previous) server
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-9721
>                 URL: https://issues.apache.org/jira/browse/HBASE-9721
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.0, 0.96.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.98.0, 0.99.0
>
>         Attachments: hbase-9721_v0.patch, hbase-9721_v1.patch, hbase-9721_v2.patch, hbase-9721_v3.patch
>
>
> On a test cluster, this following events happened with ITBLL and CM leading to meta being
unavailable until master is restarted. 
> An RS carrying meta died, and master assigned the region to one of the RSs. 
> {code}
> 2013-10-03 23:30:06,611 INFO  [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1]
master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> 2013-10-03 23:30:06,611 INFO  [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1]
master.RegionStates: Transitioned {1588230740 state=OFFLINE, ts=1380843006601, server=null}
to {1588230740 state=PENDING_OPEN, ts=1380843006611, server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}
> 2013-10-03 23:30:06,611 DEBUG [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1]
master.ServerManager: New admin connection to gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> {code}
> At the same time, the RS that meta recently got assigned also died (due to CM), and restarted:

> {code}
> 2013-10-03 23:30:07,636 DEBUG [RpcServer.handler=17,port=60000] master.ServerManager:
REPORT: Server gs-hdp2-secure-1380781860-hbase-8.cs1cloud.internal,60020,1380843002494 came
back up, removed it from the dead servers list
> 2013-10-03 23:30:08,769 INFO  [RpcServer.handler=18,port=60000] master.ServerManager:
Triggering server recovery; existingServer gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
looks stale, new server:gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362
> 2013-10-03 23:30:08,771 DEBUG [RpcServer.handler=18,port=60000] master.AssignmentManager:
Checking region=hbase:meta,,1.1588230740, zk server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
current=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820, matches=true
> 2013-10-03 23:30:08,771 DEBUG [RpcServer.handler=18,port=60000] master.ServerManager:
Added=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 to dead servers,
submitted shutdown handler to be executed meta=true
> 2013-10-03 23:30:08,771 INFO  [RpcServer.handler=18,port=60000] master.ServerManager:
Registering server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362
> 2013-10-03 23:30:08,772 INFO  [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2]
handler.MetaServerShutdownHandler: Splitting hbase:meta logs for gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> {code}
> AM/SSH sees that the RS that died was carrying meta, but the assignment RPC request was
still not sent:
> {code}
> 2013-10-03 23:30:08,791 DEBUG [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2]
master.AssignmentManager: Checking region=hbase:meta,,1.1588230740, zk server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
current=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820, matches=true
> 2013-10-03 23:30:08,791 INFO  [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2]
handler.MetaServerShutdownHandler: Server gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
was carrying META. Trying to assign.
> 2013-10-03 23:30:08,791 DEBUG [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2]
master.RegionStates: Offline 1588230740 with current state=PENDING_OPEN, expected state=OFFLINE/SPLITTING/MERGING
> 2013-10-03 23:30:08,791 INFO  [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2]
master.RegionStates: Transitioned {1588230740 state=PENDING_OPEN, ts=1380843006611, server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}
to {1588230740 state=OFFLINE, ts=1380843008791, server=null}
> 2013-10-03 23:30:09,809 INFO  [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2]
zookeeper.ZooKeeperNodeTracker: Unsetting hbase:meta region location in ZooKeeper
> {code}
> Our first attempt at the assign rpc fails, because the new server is now starting. The
second attempt though succeeds: 
> {code}
> 2013-10-03 23:30:10,621 INFO  [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1]
master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> 2013-10-03 23:30:10,621 INFO  [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1]
master.RegionStates: Transitioned {1588230740 state=OFFLINE, ts=1380843008791, server=null}
to {1588230740 state=PENDING_OPEN, ts=1380843010621, server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}
> 2013-10-03 23:30:10,621 DEBUG [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1]
master.ServerManager: New admin connection to gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> 2013-10-03 23:30:10,622 DEBUG [RpcServer.handler=22,port=60000] master.ServerManager:
REPORT: Server gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 came
back up, removed it from the dead servers list
> 2013-10-03 23:30:10,934 DEBUG [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2]
master.AssignmentManager: Skip assigning {ENCODED => 1588230740, NAME => 'hbase:meta,,1',
STARTKEY => '', ENDKEY => ''}, it is already {1588230740 state=PENDING_OPEN, ts=1380843010621,
server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}
> {code}
> Note that the start time for the server does not match (1380842900820 is old , 1380843006362
is new). 
> The region server got the rpc to open the region, but failed to change the zk state,
because ServerNames is not matching: 
> {code}
> 2013-10-03 23:30:10,601 INFO  [Priority.RpcServer.handler=0,port=60020] regionserver.HRegionServer:
Open hbase:meta,,1.1588230740
> 2013-10-03 23:30:10,897 DEBUG [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0]
zookeeper.ZKAssign: regionserver:60020-0x1417d489d9b0bd6 Transitioning 1588230740 from M_ZK_REGION_OFFLINE
to RS_ZK_REGION_OPENING
> 2013-10-03 23:30:10,918 WARN  [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0]
zookeeper.ZKAssign: regionserver:60020-0x1417d489d9b0bd6 Attempt to transition the unassigned
node for 1588230740 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the server that
tried to transition was gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362
not the expected gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> 2013-10-03 23:30:10,918 WARN  [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0]
handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for region=1588230740
> 2013-10-03 23:30:10,919 WARN  [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0]
handler.OpenRegionHandler: Region was hijacked? Opening cancelled for encodedName=1588230740
> 2013-10-03 23:30:10,919 INFO  [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0]
handler.OpenRegionHandler: Opening of region {ENCODED => 1588230740, NAME => 'hbase:meta,,1',
STARTKEY => '', ENDKEY => ''} failed, transitioning from OFFLINE to FAILED_OPEN in ZK,
expecting version 0
> 2013-10-03 23:30:10,919 DEBUG [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0]
zookeeper.ZKAssign: regionserver:60020-0x1417d489d9b0bd6 Transitioning 1588230740 from M_ZK_REGION_OFFLINE
to RS_ZK_REGION_FAILED_OPEN
> 2013-10-03 23:30:10,921 WARN  [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0]
zookeeper.ZKAssign: regionserver:60020-0x1417d489d9b0bd6 Attempt to transition the unassigned
node for 1588230740 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_FAILED_OPEN failed, the server
that tried to transition was gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362
not the expected gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> 2013-10-03 23:30:10,921 WARN  [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0]
handler.OpenRegionHandler: Unable to mark region {ENCODED => 1588230740, NAME => 'hbase:meta,,1',
STARTKEY => '', ENDKEY => ''} as FAILED_OPEN. It's likely that the master already timed
out this open attempt, and thus another RS already has the region.
> {code}
> It seems that the RS behaved correct by not being able to open the region by transitioning
the zk assignment node. However, the master fails to timeout the assignment even though the
meta region is reported in RIT: 
> {code}
> 2013-10-04 00:14:50,658 DEBUG [gs-hdp2-secure-1380781860-hbase-12.cs1cloud.internal,60000,1380842679864-BalancerChore]
master.HMaster: Not running balancer because 1 region(s) in transition: {1588230740={1588230740
state=PENDING_OPEN, ts=1380843010621, server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}}
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message