Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1E15A18942 for ; Wed, 28 Oct 2015 15:44:35 +0000 (UTC) Received: (qmail 92319 invoked by uid 500); 28 Oct 2015 15:44:28 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 92279 invoked by uid 500); 28 Oct 2015 15:44:28 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 92245 invoked by uid 99); 28 Oct 2015 15:44:27 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Oct 2015 15:44:27 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id B5EFD2C1F57 for ; Wed, 28 Oct 2015 15:44:27 +0000 (UTC) Date: Wed, 28 Oct 2015 15:44:27 +0000 (UTC) From: "Samir Ahmic (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-14664) Master failover issue: Backup master is unable to start if active master is killed and started in short time interval MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-14664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Samir Ahmic updated HBASE-14664: -------------------------------- Attachment: HBASE-14664.patch Here is patch for this issue. Logic is following: if we detect that there is no active master, which implies that there is no regionserver hosting hbase:meta table, remove '/hbase/meta-region-server' znode from zk. > Master failover issue: Backup master is unable to start if active master is killed and started in short time interval > --------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-14664 > URL: https://issues.apache.org/jira/browse/HBASE-14664 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 2.0.0 > Reporter: Samir Ahmic > Assignee: Samir Ahmic > Fix For: 2.0.0 > > Attachments: HBASE-14664.patch > > > I notice this issue while running IntegrationTestDDLMasterFailover, it can be simply reproduced by executing this on active master (tested on two masters + 3rs cluster setup) > {code} > $ kill -9 master_pid; hbase-daemon.sh start master > {code} > Logs show that new active master is trying to locate hbase:meta table on restarted active master > {code} > 2015-10-21 19:28:20,804 INFO [hnode2:16000.activeMasterManager] zookeeper.MetaTableLocator: Failed verification of hbase:meta,,1 at address=hnode1,16000,1445447051681, exception=org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not running yet > at org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1092) > at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1330) > at org.apache.hadoop.hbase.master.MasterRpcServices.getRegionInfo(MasterRpcServices.java:1525) > at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22233) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2136) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:106) > at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > at java.lang.Thread.run(Thread.java:745) > 2015-10-21 19:28:20,805 INFO [hnode2:16000.activeMasterManager] master.HMaster: Meta was in transition on hnode1,16000,1445447051681 > 2015-10-21 19:28:20,805 INFO [hnode2:16000.activeMasterManager] master.AssignmentManager: Processing {1588230740 state=OPEN, ts=1445448500598, server=hnode1,16000,1445447051681 > {code} > and because of above master is unable to read hbase:meta table: > {code} > 2015-10-21 19:28:49,429 INFO [hconnection-0x6e9cebcc-shared--pool6-t1] client.AsyncProcess: #2, table=hbase:meta, attempt=10/351 failed=1ops, last exception: org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not running yet > at org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1092) > at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2083) > at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32462) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2136) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:106) > at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > at java.lang.Thread.run(Thread.java:745) > {code} > which cause master is unable to complete start. > I have also notices that in this case value of /hbase/meta-region-server znode is always pointing on restarted active master (hnode1 in my cluster ). > I was able to workaround this issue by repeating same scenario with following: > {code} > $ kill -9 master_pid; hbase zkcli rmr /hbase/meta-region-server; hbase-daemon.sh start master > {code} > So issue is probably caused by staled value in /hbase/meta-region-server znode. I will try to create patch based on above. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)