hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-9777) Two consecutive RS crashes could lead to their SSH stepping on each other's toes and cause master abort
Date Wed, 16 Oct 2013 00:57:42 GMT
Devaraj Das created HBASE-9777:
----------------------------------

             Summary: Two consecutive RS crashes could lead to their SSH stepping on each
other's toes and cause master abort
                 Key: HBASE-9777
                 URL: https://issues.apache.org/jira/browse/HBASE-9777
             Project: HBase
          Issue Type: Bug
            Reporter: Devaraj Das


Here is the sequence of events:
1. Master assigns regions to some server RS1. One particular region is 300d71b112325d43b99b6148ec7bc5b3
2. RS1 crashes
3. Master tries to bulk-reassign (this has retries as well) the regions to other RSs. Let's
say one of them is RS2.
{noformat}
2013-10-14 21:16:22,218 INFO  [hor13n02.gq1.ygridcore.net,60000,1381784464025-GeneralBulkAssigner-0]
master.RegionStates: Transitioned {300d71b112325d43b99b6148ec7bc5b3 state=OFFLINE, ts=1381785382125,
server=null} to {300d71b112325d43b99b6148ec7bc5b3 state=PENDING_OPEN, ts=1381785382218, server=hor13n04.gq1.ygridcore.net,60020,1381784772417}
{noformat}
4. RS2 crashes
5. The ServerShutdownHandler for RS2 is executed, and it tries to reassign the regions.
{noformat}
2013-10-14 21:16:32,185 INFO  [MASTER_SERVER_OPERATIONS-hor13n02:60000-3] master.RegionStates:
Found opening region {300d71b112325d43b99b6148ec7bc5b3 state=PENDING_OPEN, ts=1381785382218,
server=hor13n04.gq1.ygridcore.net,60020,1381784772417} to be reassigned by SSH for hor13n04.gq1.ygridcore.net,60020,1381784772417
{noformat}
6. (5) succeeds. The region states are made OPEN.
7. The retry from (3) kicks in 
{noformat}
2013-10-14 21:16:22,222 INFO  [MASTER_SERVER_OPERATIONS-hor13n02:60000-1] master.GeneralBulkAssigner:
Failed assigning 52 regions to server hor13n04.gq1.ygridcore.net,60020,1381784772417, reassigning
them
{noformat}
8. The retry finds some region state as OPEN, and the master aborts with the stack trace:
{noformat}
2013-10-14 21:16:34,342 FATAL AM.-pool1-t46 master.HMaster: Unexpected state :
{300d71b112325d43b99b6148ec7bc5b3 state=OPEN, ts=1381785392864, server=hor13n08.gq1.ygridcore.net,60020,1381785385596}
.. Cannot transit it to OFFLINE.
java.lang.IllegalStateException: Unexpected state : {300d71b112325d43b99b6148ec7bc5b3 state=OPEN,
ts=1381785392864, server=hor13n08.gq1.ygridcore.net,60020,1381785385596}
.. Cannot transit it to OFFLINE.
at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:2074)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1855)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1449)
at org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:45)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message