hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13891) AM should handle RegionServerStoppedException during assignment
Date Thu, 11 Jun 2015 23:56:02 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582706#comment-14582706
] 

Enis Soztutar commented on HBASE-13891:
---------------------------------------

bq. Probably the RegionServerStoppedException should be detected and the destination of the
plan be added to the dead server list.
Catching this makes sense, but it is not clear how to do handling. We should not have more
than one source of cluster membership (see HBASE-13605). If we for example catch this and
run SSH, it means that we are using both zk and the RPC failures as a way to detect cluster
membership. 

If we can find a way to change the target for the region assignment, that may prevent this
type of assignment loop. 

> AM should handle RegionServerStoppedException during assignment
> ---------------------------------------------------------------
>
>                 Key: HBASE-13891
>                 URL: https://issues.apache.org/jira/browse/HBASE-13891
>             Project: HBase
>          Issue Type: Bug
>          Components: master, Region Assignment
>    Affects Versions: 1.1.0.1
>            Reporter: Nick Dimiduk
>
> I noticed the following in the master logs
> {noformat}
> 2015-06-11 11:04:55,278 WARN  [AM.ZK.Worker-pool2-t337] master.AssignmentManager: Failed
assignment of SYSTEM.SEQUENCE,\x8E\x00\x00\x00,1434010321127.d2be67cf43d6bd600c7f461701ca908f.
to ip-172-31-32-232.ec2.internal,16020,1434020633773, trying to assign elsewhere instead;
try=1 of 10
> org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException:
Server ip-172-31-32-232.ec2.internal,16020,1434020633773 not running, aborting
> 	at org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:980)
> 	at org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1382)
> 	at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22117)
> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2112)
> 	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
> 	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> 	at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> 	at java.lang.Thread.run(Thread.java:745)
> 	at sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown Source)
> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
> 	at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:322)
> 	at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:752)
> 	at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2136)
> 	at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1590)
> 	at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1568)
> 	at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:106)
> 	at org.apache.hadoop.hbase.master.AssignmentManager.handleRegion(AssignmentManager.java:1063)
> 	at org.apache.hadoop.hbase.master.AssignmentManager$6.run(AssignmentManager.java:1511)
> 	at org.apache.hadoop.hbase.master.AssignmentManager$3.run(AssignmentManager.java:1295)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.regionserver.RegionServerStoppedException):
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server ip-172-31-32-232.ec2.internal,16020,1434020633773
not running, aborting
> 	at org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:980)
> 	at org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1382)
> 	at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22117)
> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2112)
> 	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
> 	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> 	at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> 	at java.lang.Thread.run(Thread.java:745)
> 	at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1206)
> 	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
> 	at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
> 	at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:23003)
> 	at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:749)
> 	... 12 more
> ...
> 2015-06-11 11:04:55,289 INFO  [AM.ZK.Worker-pool2-t337] master.AssignmentManager: Assigning
SYSTEM.SEQUENCE,\x8E\x00\x00\x00,1434010321127.d2be67cf43d6bd600c7f461701ca908f. to ip-172-31-32-232.ec2.internal,16020,1434020633773
> ...
> 2015-06-11 11:04:55,317 WARN  [AM.ZK.Worker-pool2-t337] master.AssignmentManager: Failed
assignment of SYSTEM.SEQUENCE,\x8E\x00\x00\x00,1434010321127.d2be67cf43d6bd600c7f461701ca908f.
to ip-172-31-32-232.ec2.internal,16020,1434020633773, trying to assign elsewhere instead;
try=2 of 10
> <same long stack redacted>
> ...
> 2015-06-11 11:04:55,332 INFO  [AM.ZK.Worker-pool2-t337] master.AssignmentManager: Assigning
SYSTEM.SEQUENCE,\x8E\x00\x00\x00,1434010321127.d2be67cf43d6bd600c7f461701ca908f. to ip-172-31-32-232.ec2.internal,16020,1434020633773
> {noformat}
> This is repeated over and over as the AM spams the same region to the same server. Probably
the {{RegionServerStoppedException}} should be detected and the destination of the plan be
added to the dead server list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message