hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maryann Xue (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
Date Thu, 12 Jul 2012 09:05:35 GMT

    [ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412618#comment-13412618
] 

Maryann Xue commented on HBASE-6299:
------------------------------------

bq. I'm not sure what synchronize does. I suppose it prevents double assign
The interesting thing is we check RegionState is OFFLINE or CLOSED before setting OFFLINE
in zk and abort if the check fails; while we allow any RegionState before setting RegionState
OFFLINE. And since this synchronize on RegionState does not guard the whole process (state
change from PEND_OPEN to OPENED), double assignment is not prevented at all, though there's
some check in setOfflineInZookeeper, but only when hijack=true.
So far i've seen two error cases with double assign:
1. HBASE-5816: The second assign comes in almost at the same time with the first assignment,but
gets locked by sychronized(state). After the first assignment succeeds with sendRegionOpen()
and exits the synchronized block, the second assignment goes into the block and calls setOfflineInZookeeper()
which fails the "RegionState Offline" check and leads to master abort.
2. The second assignment kicks in after the first assignment succeeded and deleted the ZK
node but before regionOnline() is called (which removes the region from AM.regionsInTransition
and adds the region to AM.regions). The second assignment starts a normal assign process,
setting RegionState OFFLINE, setting ZK OFFLINE, and calls sendRegionOpen() to the same dest
RS. Then, when the first assignment calls AM.regionOnline(), this region get removed from
AM.regionsInTranistion. This is a double assignment to the RS. if RS chooses to cleanUpFailedOpen()
as in 0.90, this region will be served nowhere and does not even exist in master's regionsInTransition;
if RS chooses to proceed on with openRegion() as in trunk, master will get RS events "OPENING",
"OPENED" related to NO RegionState, as in HBASE-6300.
I can see we check if ZK node exists in setOfflineInZookeeper to prevent double assignment,
but this check is only effective when hijack=true.
Is it possible that we can do something in an earlier stage to prevent double assignment?
like in forceRegionStateToOffline()?

bq. @Mary Is HBASE-5396 committed?
No... but explicitly calling assign() from HBaseAdmin can cause the same problem.
                
> RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency
in HMaster's region state and a series of successive problems.
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6299
>                 URL: https://issues.apache.org/jira/browse/HBASE-6299
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.6, 0.94.0
>            Reporter: Maryann Xue
>            Assignee: Maryann Xue
>            Priority: Critical
>         Attachments: HBASE-6299-v2.patch, HBASE-6299.patch
>
>
> 1. HMaster tries to assign a region to an RS.
> 2. HMaster creates a RegionState for this region and puts it into regionsInTransition.
> 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open
region request and starts to proceed, with success eventually. However, due to network problems,
HMaster fails to receive the response for the openRegion() call, and the call times out.
> 4. HMaster attemps to assign for a second time, choosing another RS. 
> 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open
of the previous RS, and the RegionState has already been removed from regionsInTransition,
HMaster finds invalid and ignores the unassigned ZK node "RS_ZK_REGION_OPENING" updated by
the second attempt.
> 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot
be created.
> {code}
> 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using
pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078
> 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning
region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
to swbss-hadoop-006,60020,1340890678078
> 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling
transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:60000, region=b713fd655fa02395496c5a6e39ddf568
> 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling
transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568
> 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling
transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568
> 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling
transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568
> 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575,
usedHeap=15282, maxHeap=31301); deleting unassigned node
> 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x2377fee2ae80007
Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected
state RS_ZK_REGION_OPENED
> 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x2377fee2ae80007
Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected
state RS_ZK_REGION_OPENED
> 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945,
regions=575, usedHeap=15282, maxHeap=31301)
> 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed
assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0,
maxHeap=0), trying to assign elsewhere instead; retry=0
> java.net.SocketTimeoutException: Call to /172.16.0.6:60020 failed on socket timeout exception:
java.net.SocketTimeoutException: 120000 millis timeout while waiting for channel to be ready
for read. ch : java.nio.channels.SocketChannel[connected local=/172.16.0.2:53765 remote=/172.16.0.6:60020]
>         at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805)
>         at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778)
>         at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283)
>         at $Proxy8.openRegion(Unknown Source)
>         at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1127)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:912)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:892)
>         at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:92)
>         at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.net.SocketTimeoutException: 120000 millis timeout while waiting for channel
to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.16.0.2:53765
remote=/172.16.0.6:60020]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:301)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>         at java.io.DataInputStream.readInt(DataInputStream.java:370)
>         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:541)
>         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:479)
> 2012-06-29 07:07:41,142 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous
transition plan was found (or we are ignoring an existing plan) for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
so generated a random one; hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
src=, dest=swbss-hadoop-164,60020,1340888346294; 15 (online=15, exclude=serverName=swbss-hadoop-006,60020,1340890678078,
load=(requests=0, regions=575, usedHeap=0, maxHeap=0)) available servers
> 2012-06-29 07:07:41,142 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x2377fee2ae80007
Creating (or updating) unassigned node for b713fd655fa02395496c5a6e39ddf568 with OFFLINE state
> 2012-06-29 07:07:41,145 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using
pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
src=, dest=swbss-hadoop-164,60020,1340888346294
> 2012-06-29 07:07:41,145 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning
region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
to swbss-hadoop-164,60020,1340888346294
> 2012-06-29 07:07:41,149 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling
transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-164,60020,1340888346294, region=b713fd655fa02395496c5a6e39ddf568
> 2012-06-29 07:07:41,150 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received
OPENING for region b713fd655fa02395496c5a6e39ddf568 from server swbss-hadoop-164,60020,1340888346294
but region was in  the state null and not in expected PENDING_OPEN or OPENING states
> 2012-06-29 07:07:41,296 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling
transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-164,60020,1340888346294, region=b713fd655fa02395496c5a6e39ddf568
> 2012-06-29 07:07:41,296 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received
OPENING for region b713fd655fa02395496c5a6e39ddf568 from server swbss-hadoop-164,60020,1340888346294
but region was in  the state null and not in expected PENDING_OPEN or OPENING states
> 2012-06-29 07:07:41,302 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling
transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-164,60020,1340888346294, region=b713fd655fa02395496c5a6e39ddf568
> 2012-06-29 07:07:41,302 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received
OPENED for region b713fd655fa02395496c5a6e39ddf568 from server swbss-hadoop-164,60020,1340888346294
but region was in  the state null and not in expected PENDING_OPEN or OPENING states
> 2012-06-29 07:08:38,872 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
src=swbss-hadoop-006,60020,1340890678078, dest=swbss-hadoop-008,60020,1340891085175
> 2012-06-29 07:08:38,872 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting
unassignment of region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
(offlining)
> 2012-06-29 07:08:47,875 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent
CLOSE to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=0, usedHeap=0,
maxHeap=0) for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
> ...
> 2012-06-29 08:04:37,681 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions
in transition timed out:  CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
state=PENDING_CLOSE, ts=1340926468331, server=null
> 2012-06-29 08:04:37,681 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region
has been PENDING_CLOSE for too long, running forced unassign again on region=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
> 2012-06-29 08:04:47,681 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions
in transition timed out:  CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
state=PENDING_CLOSE, ts=1340926468331, server=null
> 2012-06-29 08:04:47,682 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region
has been PENDING_CLOSE for too long, running forced unassign again on region=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message