Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 632FF18E23 for ; Mon, 6 Jul 2015 10:05:07 +0000 (UTC) Received: (qmail 82872 invoked by uid 500); 6 Jul 2015 10:05:07 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 82831 invoked by uid 500); 6 Jul 2015 10:05:07 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 82807 invoked by uid 99); 6 Jul 2015 10:05:06 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Jul 2015 10:05:06 +0000 Date: Mon, 6 Jul 2015 10:05:06 +0000 (UTC) From: "Vinayakumar B (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-12186) ActiveStandbyElector shouldn't call monitorLockNodeAsync before StatCallback for previous zkClient.exists is received. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-12186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614811#comment-14614811 ] Vinayakumar B commented on HADOOP-12186: ---------------------------------------- Patch looks good to me. +1. Will commit soon. > ActiveStandbyElector shouldn't call monitorLockNodeAsync before StatCallback for previous zkClient.exists is received. > ---------------------------------------------------------------------------------------------------------------------- > > Key: HADOOP-12186 > URL: https://issues.apache.org/jira/browse/HADOOP-12186 > Project: Hadoop Common > Issue Type: Bug > Components: ha > Affects Versions: 2.7.1 > Reporter: zhihai xu > Assignee: zhihai xu > Attachments: HADOOP-12186.000.patch > > > ActiveStandbyElector shouldn't call {{monitorLockNodeAsync}} before StatCallback for previous {{zkClient.exists}} is received. > We saw RM shutdown because ActiveStandbyElector retrying monitorLockNodeAsync exceeded limit. The following is the logs. > Based on the log, it looks like multiple {{monitorLockNodeAsync}} are called at the same time due to back-to-back SyncConnected event received. > The current code doesn't prevent {{zkClient.exists}} from being called before AsyncCallback.StatCallback for previous {{zkClient.exists}} is received. > So the retry for {{monitorLockNodeAsync}} doesn't work correctly sometimes. > {code} > 2015-07-01 19:24:12,806 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6674ms for sessionid 0x14e47693cc20007, closing socket connection and attempting reconnect > 2015-07-01 19:24:12,919 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode... > 2015-07-01 19:24:14,704 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server node-1.internal/192.168.123.3:2181. Will not attempt to authenticate using SASL (unknown error) > 2015-07-01 19:24:14,704 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.123.3:43487, server: node-1.internal/192.168.123.3:2181 > 2015-07-01 19:24:14,707 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server node-1.internal/192.168.123.3:2181, sessionid = 0x14e47693cc20007, negotiated timeout = 10000 > 2015-07-01 19:24:14,712 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected. > 2015-07-01 19:24:21,374 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6667ms for sessionid 0x14e47693cc20007, closing socket connection and attempting reconnect > 2015-07-01 19:24:21,477 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode... > 2015-07-01 19:24:22,640 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server node-1.internal/192.168.123.3:2181. Will not attempt to authenticate using SASL (unknown error) > 2015-07-01 19:24:22,640 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.123.3:43526, server: node-1.internal/192.168.123.3:2181 > 2015-07-01 19:24:22,641 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server node-1.internal/192.168.123.3:2181, sessionid = 0x14e47693cc20007, negotiated timeout = 10000 > 2015-07-01 19:24:22,642 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected. > 2015-07-01 19:24:29,310 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6669ms for sessionid 0x14e47693cc20007, closing socket connection and attempting reconnect > 2015-07-01 19:24:29,413 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode... > 2015-07-01 19:24:30,738 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server node-1.internal/192.168.123.3:2181. Will not attempt to authenticate using SASL (unknown error) > 2015-07-01 19:24:30,739 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.123.3:43574, server: node-1.internal/192.168.123.3:2181 > 2015-07-01 19:24:30,739 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server node-1.internal/192.168.123.3:2181, sessionid = 0x14e47693cc20007, negotiated timeout = 10000 > 2015-07-01 19:24:30,740 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected. > 2015-07-01 19:24:37,409 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6670ms for sessionid 0x14e47693cc20007, closing socket connection and attempting reconnect > 2015-07-01 19:24:37,512 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode... > 2015-07-01 19:24:38,979 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server node-1.internal/192.168.123.3:2181. Will not attempt to authenticate using SASL (unknown error) > 2015-07-01 19:24:38,979 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.123.3:43598, server: node-1.internal/192.168.123.3:2181 > 2015-07-01 19:24:38,980 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server node-1.internal/192.168.123.3:2181, sessionid = 0x14e47693cc20007, negotiated timeout = 10000 > 2015-07-01 19:24:38,981 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected. > 2015-07-01 19:24:45,649 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6669ms for sessionid 0x14e47693cc20007, closing socket connection and attempting reconnect > 2015-07-01 19:24:45,752 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. > 2015-07-01 19:24:45,855 INFO org.apache.zookeeper.ZooKeeper: Session: 0x14e47693cc20007 closed > 2015-07-01 19:25:07,932 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x14e47693cc20007 > 2015-07-01 19:25:07,932 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type EMBEDDED_ELECTOR_FAILED. Cause: > Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. > 2015-07-01 19:25:07,932 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x14e47693cc20007 > 2015-07-01 19:25:07,932 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x14e47693cc20007 > 2015-07-01 19:25:07,933 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x14e47693cc20007 > 2015-07-01 19:25:07,933 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down > 2015-07-01 19:25:08,036 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)