Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0A92717962 for ; Fri, 12 Jun 2015 01:17:01 +0000 (UTC) Received: (qmail 40198 invoked by uid 500); 12 Jun 2015 01:17:00 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 40143 invoked by uid 500); 12 Jun 2015 01:17:00 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 40131 invoked by uid 99); 12 Jun 2015 01:17:00 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Jun 2015 01:17:00 +0000 Date: Fri, 12 Jun 2015 01:17:00 +0000 (UTC) From: "Nick Dimiduk (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-13891) AM should handle RegionServerStoppedException during assignment MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-13891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Dimiduk updated HBASE-13891: --------------------------------- Attachment: 13891.patch Something like this? > AM should handle RegionServerStoppedException during assignment > --------------------------------------------------------------- > > Key: HBASE-13891 > URL: https://issues.apache.org/jira/browse/HBASE-13891 > Project: HBase > Issue Type: Bug > Components: master, Region Assignment > Affects Versions: 1.1.0.1 > Reporter: Nick Dimiduk > Attachments: 13891.patch > > > I noticed the following in the master logs > {noformat} > 2015-06-11 11:04:55,278 WARN [AM.ZK.Worker-pool2-t337] master.AssignmentManager: Failed assignment of SYSTEM.SEQUENCE,\x8E\x00\x00\x00,1434010321127.d2be67cf43d6bd600c7f461701ca908f. to ip-172-31-32-232.ec2.internal,16020,1434020633773, trying to assign elsewhere instead; try=1 of 10 > org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server ip-172-31-32-232.ec2.internal,16020,1434020633773 not running, aborting > at org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:980) > at org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1382) > at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22117) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2112) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) > at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > at java.lang.Thread.run(Thread.java:745) > at sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown Source) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95) > at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:322) > at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:752) > at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2136) > at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1590) > at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1568) > at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:106) > at org.apache.hadoop.hbase.master.AssignmentManager.handleRegion(AssignmentManager.java:1063) > at org.apache.hadoop.hbase.master.AssignmentManager$6.run(AssignmentManager.java:1511) > at org.apache.hadoop.hbase.master.AssignmentManager$3.run(AssignmentManager.java:1295) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.regionserver.RegionServerStoppedException): org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server ip-172-31-32-232.ec2.internal,16020,1434020633773 not running, aborting > at org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:980) > at org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1382) > at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22117) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2112) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) > at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1206) > at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213) > at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287) > at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:23003) > at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:749) > ... 12 more > ... > 2015-06-11 11:04:55,289 INFO [AM.ZK.Worker-pool2-t337] master.AssignmentManager: Assigning SYSTEM.SEQUENCE,\x8E\x00\x00\x00,1434010321127.d2be67cf43d6bd600c7f461701ca908f. to ip-172-31-32-232.ec2.internal,16020,1434020633773 > ... > 2015-06-11 11:04:55,317 WARN [AM.ZK.Worker-pool2-t337] master.AssignmentManager: Failed assignment of SYSTEM.SEQUENCE,\x8E\x00\x00\x00,1434010321127.d2be67cf43d6bd600c7f461701ca908f. to ip-172-31-32-232.ec2.internal,16020,1434020633773, trying to assign elsewhere instead; try=2 of 10 > > ... > 2015-06-11 11:04:55,332 INFO [AM.ZK.Worker-pool2-t337] master.AssignmentManager: Assigning SYSTEM.SEQUENCE,\x8E\x00\x00\x00,1434010321127.d2be67cf43d6bd600c7f461701ca908f. to ip-172-31-32-232.ec2.internal,16020,1434020633773 > {noformat} > This is repeated over and over as the AM spams the same region to the same server. Probably the {{RegionServerStoppedException}} should be detected and the destination of the plan be added to the dead server list. -- This message was sent by Atlassian JIRA (v6.3.4#6332)