hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jimmy Xiang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassign the same region
Date Fri, 16 Nov 2012 18:38:12 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13499003#comment-13499003
] 

Jimmy Xiang commented on HBASE-5816:
------------------------------------

@Stack, sure. In 0.94, assigning the same region concurrently can easily lead to this issue.
 The assign method is synchronized on the region state. Before going to the synchronized assign
method, the region is moved to OFFLINE state and added into RIT if it is not already. If the
region is already in transition, the region state is not changed (if not hijack, which is
used by timeout monitor only). Once going into the synchronized assign method, AM tries to
set the region offline in ZK. However, the region state is PENDING_OPEN/OPENING instead of
offline in this case, so the master aborts.

In trunk, it is different:

{noformat}
      RegionState state = forceRegionStateToOffline(region, forceNewPlan);
      if (state != null) {
        assign(state, setOfflineInZK, forceNewPlan);
      }
{noformat}

The forceRegionStateToOffline returns null if the region is already in transition, so we won't
assign it again, so we won't get into the problem.

As to assigning the region to a dead server, during the assign attempts, a new plan will be
used.
As to assigning by SSH with forceNewPlan = true, forceRegionStateToOffline will abort the
previous assignment if still assigning, close the region if already assigned.
All these assignment calls are synchronized on the region.
Region state change by ZK event thread is also synchronized on the region.

That's why I think we are good with the trunk branch. 
                
> Balancer and ServerShutdownHandler concurrently reassign the same region
> ------------------------------------------------------------------------
>
>                 Key: HBASE-5816
>                 URL: https://issues.apache.org/jira/browse/HBASE-5816
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.6
>            Reporter: Maryann Xue
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Critical
>         Attachments: HBASE-5816.patch
>
>
> The first assign thread exits with success after updating the RegionState to PENDING_OPEN,
while the second assign follows immediately into "assign" and fails the RegionState check
in setOfflineInZooKeeper(). This causes the master to abort.
> In the below case, the two concurrent assigns occurred when AM tried to assign a region
to a dying/dead RS, and meanwhile the ShutdownServerHandler tried to assign this region (from
the region plan) spontaneously.
> {code}
> 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
src=hadoop05.sh.intel.com,60020,1334544902186, dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:44:57,648 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting
unassignment of region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
(offlining)
> 2012-04-17 05:44:57,648 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent
CLOSE to serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, regions=0,
usedHeap=0, maxHeap=0) for region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
> 2012-04-17 05:44:57,666 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling
new unassigned node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
> 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing
OFFLINE; was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. state=CLOSED,
ts=1334612697672, server=hadoop05.sh.intel.com,60020,1334544902186
> 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x236b912e9b3000e
Creating (or updating) unassigned node for fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
> 2012-04-17 05:52:59,096 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using
pre-existing plan for region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.;
plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., src=hadoop05.sh.intel.com,60020,1334544902186,
dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:52:59,096 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning
region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:54:19,159 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing
OFFLINE; was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. state=PENDING_OPEN,
ts=1334613179096, server=xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:54:59,033 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed
assignment of TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253,
load=(requests=0, regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
> java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket timeout
exception: java.net.SocketTimeoutException: 120000 millis timeout while waiting for channel
to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302
remote=/10.239.47.87:60020]
>         at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805)
>         at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778)
>         at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283)
>         at $Proxy7.openRegion(Unknown Source)
>         at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1127)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:912)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:892)
>         at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:92)
>         at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.net.SocketTimeoutException: 120000 millis timeout while waiting for channel
to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302
remote=/10.239.47.87:60020]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:301)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>         at java.io.DataInputStream.readInt(DataInputStream.java:370)
>         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:541)
>         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:479)
> 2012-04-17 05:54:59,035 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous
transition plan was found (or we are ignoring an existing plan) for TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
so generated a random one; hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
src=, dest=hadoop06.sh.intel.com,60020,1334544901894; 7 (online=7, exclude=serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253,
load=(requests=0, regions=0, usedHeap=0, maxHeap=0)) available servers
> 2012-04-17 05:54:59,035 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x236b912e9b3000e
Creating (or updating) unassigned node for fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
> 2012-04-17 05:54:59,045 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using
pre-existing plan for region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.;
plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., src=, dest=hadoop06.sh.intel.com,60020,1334544901894
> 2012-04-17 05:54:59,045 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning
region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to hadoop06.sh.intel.com,60020,1334544901894
> 2012-04-17 05:54:59,046 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state
trying to OFFLINE; TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. state=PENDING_OPEN,
ts=1334613299045, server=hadoop06.sh.intel.com,60020,1334544901894
> java.lang.IllegalStateException
>         at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1167)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1107)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:912)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:892)
>         at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:259)
>         at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2012-04-17 05:54:59,047 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message