Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7A23E1036F for ; Wed, 4 Mar 2015 19:29:39 +0000 (UTC) Received: (qmail 84333 invoked by uid 500); 4 Mar 2015 19:29:39 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 84293 invoked by uid 500); 4 Mar 2015 19:29:39 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 84279 invoked by uid 99); 4 Mar 2015 19:29:39 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Mar 2015 19:29:39 +0000 Date: Wed, 4 Mar 2015 19:29:39 +0000 (UTC) From: "zhihai xu (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3242: ---------------------------- Attachment: YARN-3242.004.patch > Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session. > ------------------------------------------------------------------------------------------------------------------------------------------------------------ > > Key: YARN-3242 > URL: https://issues.apache.org/jira/browse/YARN-3242 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.6.0 > Reporter: zhihai xu > Assignee: zhihai xu > Priority: Critical > Attachments: YARN-3242.000.patch, YARN-3242.001.patch, YARN-3242.002.patch, YARN-3242.003.patch, YARN-3242.004.patch > > > Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. > The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. > This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. > We only have one ZKRMStateStore but we can have multiple ZK client sessions. > Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. > For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null > {code} > case Disconnected: > LOG.info("ZKRMStateStore Session disconnected"); > oldZkClient = zkClient; > zkClient = null; > break; > {code} > Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. > Then we will see all the ZKRMStateStore operations fail with IOException "Wait for ZKClient creation timed out" until RM shutdown. > The following code from zookeeper(ClientCnxn#EventThread) show even after receive eventOfDeath, EventThread will still process all the events until waitingEvents queue is empty. > {code} > while (true) { > Object event = waitingEvents.take(); > if (event == eventOfDeath) { > wasKilled = true; > } else { > processEvent(event); > } > if (wasKilled) > synchronized (waitingEvents) { > if (waitingEvents.isEmpty()) { > isRunning = false; > break; > } > } > } > private void processEvent(Object event) { > try { > if (event instanceof WatcherSetEventPair) { > // each watcher will process the event > WatcherSetEventPair pair = (WatcherSetEventPair) event; > for (Watcher watcher : pair.watchers) { > try { > watcher.process(pair.event); > } catch (Throwable t) { > LOG.error("Error while calling watcher ", t); > } > } > } else { > public void disconnect() { > if (LOG.isDebugEnabled()) { > LOG.debug("Disconnecting client for session: 0x" > + Long.toHexString(getSessionId())); > } > sendThread.close(); > eventThread.queueEventOfDeath(); > } > public void close() throws IOException { > if (LOG.isDebugEnabled()) { > LOG.debug("Closing client for session: 0x" > + Long.toHexString(getSessionId())); > } > try { > RequestHeader h = new RequestHeader(); > h.setType(ZooDefs.OpCode.closeSession); > submitRequest(h, null, null, null); > } catch (InterruptedException e) { > // ignore, close the send/event threads > } finally { > disconnect(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)