hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeffrey Zhong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-9773) Master aborted when hbck asked the master to assign a region that was already online
Date Wed, 16 Oct 2013 22:12:43 GMT

    [ https://issues.apache.org/jira/browse/HBASE-9773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797344#comment-13797344
] 

Jeffrey Zhong commented on HBASE-9773:
--------------------------------------

I checked the fix and I think it opens the door for double assignment. Basically closeRegion
request is processed asynchronously. Even we send close RPC to a region's host region server,
the region could open on another region server before the old region server really close the
region. Then we end up in double assignment issue.

In addition, we potentially have a data loss situation. AM#forceRegionStateToOffline doesn't
wait for region is fully closed. If a region is open while the old RS still flush, then some
store files may not open in the new location. Even more, if the old RS crashes, the WAL splitting
will be skipped then we have a permanent data loss.

[~jxiang] Could you please double check the above? Meanwhile let me try to come up an addendum
patch. Thanks.  

> Master aborted when hbck asked the master to assign a region that was already online
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-9773
>                 URL: https://issues.apache.org/jira/browse/HBASE-9773
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Jimmy Xiang
>             Fix For: 0.98.0, 0.96.1
>
>         Attachments: trunk-9773.patch, trunk-9773_v2.patch
>
>
> Came across this situation (with a version of 0.96 very close to RC5 version created
on 10/11):
> The sequence of events that happened:
> 1. The hbck tool couldn't communicate with the RegionServer hosting namespace region
due to some security exceptions. hbck INCORRECTLY assumed the region was not deployed.
> In output.log (client side):
> {noformat}
> 2013-10-12 10:42:57,067|beaver.machine|INFO|ERROR: Region { meta => hbase:namespace,,1381564449706.a0ac0825ba2d0830614e7f808f31787a.,
hdfs => hdfs://gs-hdp2-secure-1381559462-hbase-12.cs1cloud.internal:8020/apps/hbase/data/data/hbase/namespace/a0ac0825ba2d0830614e7f808f31787a,
deployed =>  } not deployed on any region server.
> 2013-10-12 10:42:57,067|beaver.machine|INFO|Trying to fix unassigned region...
> {noformat}
> 2. This led to the hbck tool trying to tell the master to "assign" the region.
> In master log (hbase-hbase-master-gs-hdp2-secure-1381559462-hbase-12.log):
> {noformat}
> 2013-10-12 10:52:35,960 INFO  [RpcServer.handler=4,port=60000] master.HMaster: Client=hbase//172.18.145.105
assign hbase:namespace,,1381564449706.a0ac0825ba2d0830614e7f808f31787a.
> {noformat}
> 3. The master went through the steps - sent a CLOSE to the RegionServer hosting namespace
region.
> From master log:
> {noformat}
> 2013-10-12 10:52:35,981 DEBUG [RpcServer.handler=4,port=60000] master.AssignmentManager:
Sent CLOSE to gs-hdp2-secure-1381559462-hbase-1.cs1cloud.internal,60020,1381564439794 for
region hbase:namespace,,1381564449706.a0ac0825ba2d0830614e7f808f31787a.
> {noformat}
> 4. The master then tried to assign the namespace region to a region server, and in the
process ABORTED:
> From master log:
> {noformat}
> 2013-10-12 10:52:36,025 DEBUG [RpcServer.handler=4,port=60000] master.AssignmentManager:
No previous transition plan found (or ignoring an existing plan) for hbase:namespace,,1381564449706.a0ac0825ba2d0830614e7f808f31787a.;
generated random plan=hri=hbase:namespace,,1381564449706.a0ac0825ba2d0830614e7f808f31787a.,
src=, dest=gs-hdp2-secure-1381559462-hbase-9.cs1cloud.internal,60020,1381564439807; 4 (online=4,
available=4) available servers, forceNewPlan=true
> 2013-10-12 10:52:36,026 FATAL [RpcServer.handler=4,port=60000] master.HMaster: Master
server abort: loaded coprocessors are: [org.apache.hadoop.hbase.security.access.AccessController]
> 2013-10-12 10:52:36,027 FATAL [RpcServer.handler=4,port=60000] master.HMaster: Unexpected
state : {a0ac0825ba2d0830614e7f808f31787a state=OPEN, ts=1381564451344, server=gs-hdp2-secure-1381559462-hbase-1.cs1cloud.internal,60020,1381564439794}
.. Cannot transit it to OFFLINE.
> java.lang.IllegalStateException: Unexpected state : {a0ac0825ba2d0830614e7f808f31787a
state=OPEN, ts=1381564451344, server=gs-hdp2-secure-1381559462-hbase-1.cs1cloud.internal,60020,1381564439794}
.. Cannot transit it to OFFLINE.
> {noformat}
> {code}AssignmentManager.assign(HRegionInfo region, boolean setOfflineInZK, boolean forceNewPlan){code}
is the method that does all the above. This was called from the HMaster with true for both
the boolean arguments.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message