hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bibin A Chundatt (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6647) RM can crash during shutdown due to InterruptedException
Date Mon, 20 Nov 2017 09:32:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259007#comment-16259007

Bibin A Chundatt commented on YARN-6647:

Adding analysis done as part of YARN-7515 in this jira
 and the interrupt exception ended up bubbling all the way up to the dispatcher which caused
the JVM exit
IIUC its not the interrupted exception bubbling cased by Zk operation interrupt which is causing
the issue. *RMFatalEvent* to {{AsyncDispatcher#EventHandler}} from *Interrupted thread* ie
{{AbstractDelegationTokenSecretManager#ExpiredTokenRemover}} is caused by  {{Zk operation
interrupt}} .  please do correct me if i am wrong. 


   try {
      } catch (InterruptedException e) {
        if (!stopped) {
              "AsyncDispatcher thread interrupted " + Thread.currentThread()
                  .getName(), e);
        // Need to reset drained flag to true if event queue is empty,
        // otherwise dispatcher will hang on stop.
        drained = eventQueue.isEmpty();
        throw new YarnRuntimeException(e);
put operation to {{LinkedBlockingQueue}} from an interrupted thread.
public void put(E e) throws InterruptedException {
     public final void acquireInterruptibly(int arg)
            throws InterruptedException {
        if (Thread.interrupted())
            throw new InterruptedException();

*RM switch over flow  which could shutdown RM*

Resource manager {{transitionToStandby()}}--> {{RMActiveService.stop()}} --> {{RMSecretManagerService#serviceStop()}}
      synchronized (noInterruptsLock) {
{{ExpiredTokenRemover}} interrupted during  {{rollMasterKey()}}  throws {{InterruptedException}}
which causes {{notifyStoreOperationFailedInternal}}   in
      try {
        LOG.info("Storing RMDTMasterKey.");
      } catch (Exception e) {
        LOG.error("Error While Storing RMDTMasterKey.", e);
        isFenced = store.notifyStoreOperationFailedInternal(e);
{{store.notifyStoreOperationFailedInternal}} eventually fires {{RMFatalEvent}} from {{ExpiredTokenRemover}}
thread which is *interrupted* 
          new RMFatalEvent(RMFatalEventType.STATE_STORE_FENCED,
eventually causing {{LinkedBlockingQueue#put}} to fail and *RM Exit*

*Solution:* We should skip {{notifyStoreOperationFailedInternal}} if the current thread is
interrupted which should avoid this case thoughts??

*Issue exist only in 3.0.o alpha+* since curator version was changed to {{2.12.0}} 

 public static<T> T      callWithRetry(CuratorZookeeperClient client, Callable<T>
proc) throws Exception
        T               result = null;
        RetryLoop       retryLoop = client.newRetryLoop();
        while ( retryLoop.shouldContinue() )
      ..      }
            catch ( Exception e )
        return result;

related jira HADOOP-14187 

> RM can crash during shutdown due to InterruptedException
> --------------------------------------------------------
>                 Key: YARN-6647
>                 URL: https://issues.apache.org/jira/browse/YARN-6647
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.0.0-alpha4
>            Reporter: Jason Lowe
> Noticed some tests were failing due to the JVM shutting down early.  I was able to reproduce
this occasionally with TestKillApplicationWithRMHA.  Stacktrace to follow.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message