Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Mon, 9 Nov 2015 14:25:11 +0000 (UTC)
From: "Pankaj Kumar (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.12896695.1443409813000.15830.1447079111108@Atlassian.JIRA>
In-Reply-To: <JIRA.12896695.1443409813000@Atlassian.JIRA>
References: <JIRA.12896695.1443409813000@Atlassian.JIRA>
 <JIRA.12896695.1443409813884@arcas>
Subject: [jira] [Commented] (HBASE-14498) Master stuck in infinite loop when
 all Zookeeper servers are unreachable.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996585#comment-14996585 ] 

Pankaj Kumar commented on HBASE-14498:
--------------------------------------

Added the test case and wait duration in abort message in V2 patch.
Please review, thanks.

> Master stuck in infinite loop when all Zookeeper servers are unreachable.
> -------------------------------------------------------------------------
>
>                 Key: HBASE-14498
>                 URL: https://issues.apache.org/jira/browse/HBASE-14498
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: Y. SREENIVASULU REDDY
>            Assignee: Pankaj Kumar
>            Priority: Blocker
>         Attachments: HBASE-14498-V2.patch, HBASE-14498.patch
>
>
> We met a weird scenario in our production environment.
> In a HA cluster,
> > Active Master (HM1) is not able to connect to any Zookeeper server (due to N/w breakdown on master machine network with Zookeeper servers).
> {code}
> 2015-09-26 15:24:47,508 INFO [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 33463ms for sessionid 0x104576b8dda0002, closing socket connection and attempting reconnect
> 2015-09-26 15:24:47,877 INFO [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)] client.FourLetterWordMain: connecting to ZK-Host1 2181
> 2015-09-26 15:24:48,236 INFO [main-SendThread(ZK-Host1:2181)] client.FourLetterWordMain: connecting to ZK-Host1 2181
> 2015-09-26 15:24:49,879 WARN [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host1
> 2015-09-26 15:24:49,879 INFO [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Opening socket connection to server ZK-Host1/ZK-IP1:2181. Will not attempt to authenticate using SASL (unknown error)
> 2015-09-26 15:24:50,238 WARN [main-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host1
> 2015-09-26 15:24:50,238 INFO [main-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Opening socket connection to server ZK-Host1/ZK-Host1:2181. Will not attempt to authenticate using SASL (unknown error)
> 2015-09-26 15:25:17,470 INFO [main-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 30023ms for sessionid 0x2045762cc710006, closing socket connection and attempting reconnect
> 2015-09-26 15:25:17,571 WARN [master/HM1-Host/HM1-IP:16000] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ZK-Host:2181,ZK-Host1:2181,ZK-Host2:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
> 2015-09-26 15:25:17,872 INFO [main-SendThread(ZK-Host:2181)] client.FourLetterWordMain: connecting to ZK-Host 2181
> 2015-09-26 15:25:19,874 WARN [main-SendThread(ZK-Host:2181)] zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host
> 2015-09-26 15:25:19,874 INFO [main-SendThread(ZK-Host:2181)] zookeeper.ClientCnxn: Opening socket connection to server ZK-Host/ZK-IP:2181. Will not attempt to authenticate using SASL (unknown error)
> {code}
> > Since HM1 was not able to connect to any ZK, so session timeout didnt happen at Zookeeper server side and HM1 didnt abort.
> > On Zookeeper session timeout standby master (HM2) registered himself as an active master. 
> > HM2 is keep on waiting for region server to report him as part of active master intialization.
> {noformat} 
> 2015-09-26 15:24:44,928 | INFO | HM2-Host:21300.activeMasterManager | Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. | org.apache.hadoop.hbase.master.ServerManager.waitForRegionServers(ServerManager.java:1011)
> ---
> ---
> 2015-09-26 15:32:50,841 | INFO | HM2-Host:21300.activeMasterManager | Waiting for region servers count to settle; currently checked in 0, slept for 483913 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. | org.apache.hadoop.hbase.master.ServerManager.waitForRegionServers(ServerManager.java:1011)
> {noformat}
> > At other end, region servers are reporting to HM1 on 3 sec interval. Here region server retrieve master location from zookeeper only when they couldn't connect to Master (ServiceException).
> Region Server will not report HM2 as per current design until unless HM1 abort,so HM2 will exit(InitializationMonitor) and again wait for region servers in loop.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)