hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anubhav Dhoot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.
Date Tue, 03 Mar 2015 14:52:06 GMT

    [ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14345161#comment-14345161
] 

Anubhav Dhoot commented on YARN-3242:
-------------------------------------

[~zxu] patch looks good overall.
Instead of blindly switching in zkClient on a connect and removing it on a disconnect, we
verify is activeZkClient is the one receiving the event
Makes sense then that we get rid of oldZkClient logic and just have one zk client activeZkCLient
that can get events, and on connection event is activated for use as zkClient to actually
do processing.

Verified that the updated unit test fails if i remove the check  if (zk != activeZkClient)
{

The only minor nits
a) is if we could add comments that activeZkClient is not used to do actual processing (thats
still zkClient) but only to process watched events and on connection event it gets activated
into zkClient.
b) Also will CountdownWatcher#setWatchedClient be ever more than once? If not rename it to
initializeWatchedClient and let it throw if client is already not null.

LGTM otherwise

> Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK
client session due to ZooKeeper asynchronously closing client session.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3242
>                 URL: https://issues.apache.org/jira/browse/YARN-3242
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-3242.000.patch, YARN-3242.001.patch, YARN-3242.002.patch, YARN-3242.003.patch
>
>
> Old ZK client session watcher event messed up new ZK client session due to ZooKeeper
asynchronously closing client session.
> The watcher event from old ZK client session can still be sent to ZKRMStateStore after
the old  ZK client session is closed.
> This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session.
> We only have one ZKRMStateStore but we can have multiple ZK client sessions.
> Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is
from current session. So the watcher event from old ZK client session which just is closed
will still be processed.
> For example, If a Disconnected event received from old session after new session is connected,
the zkClient will be set to null
> {code}
>         case Disconnected:
>           LOG.info("ZKRMStateStore Session disconnected");
>           oldZkClient = zkClient;
>           zkClient = null;
>           break;
> {code}
> Then ZKRMStateStore won't receive SyncConnected event from new session because new session
is already in SyncConnected state and it won't send SyncConnected event until it is disconnected
and connected again.
> Then we will see all the ZKRMStateStore operations fail with IOException "Wait for ZKClient
creation timed out" until  RM shutdown.
> The following code from zookeeper(ClientCnxn#EventThread) show even after receive eventOfDeath,
EventThread will still process all the events until  waitingEvents queue is empty.
> {code}
>               while (true) {
>                  Object event = waitingEvents.take();
>                  if (event == eventOfDeath) {
>                     wasKilled = true;
>                  } else {
>                     processEvent(event);
>                  }
>                  if (wasKilled)
>                     synchronized (waitingEvents) {
>                        if (waitingEvents.isEmpty()) {
>                           isRunning = false;
>                           break;
>                        }
>                     }
>               }
>       private void processEvent(Object event) {
>           try {
>               if (event instanceof WatcherSetEventPair) {
>                   // each watcher will process the event
>                   WatcherSetEventPair pair = (WatcherSetEventPair) event;
>                   for (Watcher watcher : pair.watchers) {
>                       try {
>                           watcher.process(pair.event);
>                       } catch (Throwable t) {
>                           LOG.error("Error while calling watcher ", t);
>                       }
>                   }
>               } else {
>     public void disconnect() {
>         if (LOG.isDebugEnabled()) {
>             LOG.debug("Disconnecting client for session: 0x"
>                       + Long.toHexString(getSessionId()));
>         }
>         sendThread.close();
>         eventThread.queueEventOfDeath();
>     }
>     public void close() throws IOException {
>         if (LOG.isDebugEnabled()) {
>             LOG.debug("Closing client for session: 0x"
>                       + Long.toHexString(getSessionId()));
>         }
>         try {
>             RequestHeader h = new RequestHeader();
>             h.setType(ZooDefs.OpCode.closeSession);
>             submitRequest(h, null, null, null);
>         } catch (InterruptedException e) {
>             // ignore, close the send/event threads
>         } finally {
>             disconnect();
>         }
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message