Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6130218B77 for ; Sat, 14 Nov 2015 10:51:11 +0000 (UTC) Received: (qmail 48837 invoked by uid 500); 14 Nov 2015 10:51:11 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 48777 invoked by uid 500); 14 Nov 2015 10:51:11 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 48766 invoked by uid 99); 14 Nov 2015 10:51:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 Nov 2015 10:51:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 0BB652C1F68 for ; Sat, 14 Nov 2015 10:51:11 +0000 (UTC) Date: Sat, 14 Nov 2015 10:51:11 +0000 (UTC) From: "Pankaj Kumar (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-14498) Master stuck in infinite loop when all Zookeeper servers are unreachable MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005327#comment-15005327 ] Pankaj Kumar commented on HBASE-14498: -------------------------------------- How does the test replicate what the original description describes? It is a tricky scenario. Thanks for reporting it. I am afraid that we may not have actually fixed the scenario described. >> As per the issue, - ZKs were not reachable to HM - HM received DISCONNECT event continuously. - On DISCONNECT we were just ignoring and keep retrying regardless of zookeeper.session.timeout. That is why master was not aborted even after zookeeper.session.timeout. I tried to simulate the same scenario in the test case. isConnected is the name of a method you would invoke to check a boolean named connected. It is not what you should name a variable. >> I will modify the variable name. Is this right? connWaitTimeOut = this.conf.getLong("zookeeper.session.timeout", 90000) * 2 / 3; IIRC, you ask zk for a session timeout and it may give you something other than what you asked for (it is a while since I dug in here) >> The idea is, The time interval (t) should be less than the ZK Session time out. (May be 2/3rd of session time out value ) , This is to make sure that standby HM will not become active within this time period. You drop the prefix here: LOG.debug("Received Disconnected from ZooKeeper."); prefix helps debugging... otherwise these zk logs are hard to trace to their origin. >> my bad, will revert this. Every call into a disconnect is going to spawn a new one of these unnamed threads? >> Yeah daemon thread will be spawned and will be active util connWaitTimeOut or SyncConnected. Did you see the below message in your log output? LOG.debug(prefix("Received Disconnected from ZooKeeper, ignoring")); >> Yes, it was written. The idea is that we could disconnect but we'll keep trying to reconnect for zk session timeout and may succeed? Has the zk session timeout expired when we get this disconnect message? Should we abort as soon as we get one of these (I wonder why we have the comment that says abort when we get such a message but we don't actually? Because the abort is done elsewhere?) >> In this scenario, ZK session will not expire for HM (zk-client) because as far as I know session expire task is initiated by ZK server (please correct me if i'm wrong), zk-client don't handle this. So HM will receive DISCONNECT even and keep trying forever util it connect to ZK. Thanks. > Master stuck in infinite loop when all Zookeeper servers are unreachable > ------------------------------------------------------------------------ > > Key: HBASE-14498 > URL: https://issues.apache.org/jira/browse/HBASE-14498 > Project: HBase > Issue Type: Bug > Components: master > Reporter: Y. SREENIVASULU REDDY > Assignee: Pankaj Kumar > Priority: Blocker > Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.4 > > Attachments: HBASE-14498-V2.patch, HBASE-14498-V3.patch, HBASE-14498-V4.patch, HBASE-14498.patch > > > We met a weird scenario in our production environment. > In a HA cluster, > > Active Master (HM1) is not able to connect to any Zookeeper server (due to N/w breakdown on master machine network with Zookeeper servers). > {code} > 2015-09-26 15:24:47,508 INFO [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 33463ms for sessionid 0x104576b8dda0002, closing socket connection and attempting reconnect > 2015-09-26 15:24:47,877 INFO [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)] client.FourLetterWordMain: connecting to ZK-Host1 2181 > 2015-09-26 15:24:48,236 INFO [main-SendThread(ZK-Host1:2181)] client.FourLetterWordMain: connecting to ZK-Host1 2181 > 2015-09-26 15:24:49,879 WARN [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host1 > 2015-09-26 15:24:49,879 INFO [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Opening socket connection to server ZK-Host1/ZK-IP1:2181. Will not attempt to authenticate using SASL (unknown error) > 2015-09-26 15:24:50,238 WARN [main-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host1 > 2015-09-26 15:24:50,238 INFO [main-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Opening socket connection to server ZK-Host1/ZK-Host1:2181. Will not attempt to authenticate using SASL (unknown error) > 2015-09-26 15:25:17,470 INFO [main-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 30023ms for sessionid 0x2045762cc710006, closing socket connection and attempting reconnect > 2015-09-26 15:25:17,571 WARN [master/HM1-Host/HM1-IP:16000] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ZK-Host:2181,ZK-Host1:2181,ZK-Host2:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master > 2015-09-26 15:25:17,872 INFO [main-SendThread(ZK-Host:2181)] client.FourLetterWordMain: connecting to ZK-Host 2181 > 2015-09-26 15:25:19,874 WARN [main-SendThread(ZK-Host:2181)] zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host > 2015-09-26 15:25:19,874 INFO [main-SendThread(ZK-Host:2181)] zookeeper.ClientCnxn: Opening socket connection to server ZK-Host/ZK-IP:2181. Will not attempt to authenticate using SASL (unknown error) > {code} > > Since HM1 was not able to connect to any ZK, so session timeout didnt happen at Zookeeper server side and HM1 didnt abort. > > On Zookeeper session timeout standby master (HM2) registered himself as an active master. > > HM2 is keep on waiting for region server to report him as part of active master intialization. > {noformat} > 2015-09-26 15:24:44,928 | INFO | HM2-Host:21300.activeMasterManager | Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. | org.apache.hadoop.hbase.master.ServerManager.waitForRegionServers(ServerManager.java:1011) > --- > --- > 2015-09-26 15:32:50,841 | INFO | HM2-Host:21300.activeMasterManager | Waiting for region servers count to settle; currently checked in 0, slept for 483913 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. | org.apache.hadoop.hbase.master.ServerManager.waitForRegionServers(ServerManager.java:1011) > {noformat} > > At other end, region servers are reporting to HM1 on 3 sec interval. Here region server retrieve master location from zookeeper only when they couldn't connect to Master (ServiceException). > Region Server will not report HM2 as per current design until unless HM1 abort,so HM2 will exit(InitializationMonitor) and again wait for region servers in loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)