hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bibin A Chundatt (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6647) RM can crash during shutdown due to InterruptedException
Date Mon, 20 Nov 2017 09:32:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259007#comment-16259007
] 

Bibin A Chundatt commented on YARN-6647:
----------------------------------------

[~jlowe]
Adding analysis done as part of YARN-7515 in this jira
{quote}
 and the interrupt exception ended up bubbling all the way up to the dispatcher which caused
the JVM exit
{quote}
IIUC its not the interrupted exception bubbling cased by Zk operation interrupt which is causing
the issue. *RMFatalEvent* to {{AsyncDispatcher#EventHandler}} from *Interrupted thread* ie
{{AbstractDelegationTokenSecretManager#ExpiredTokenRemover}} is caused by  {{Zk operation
interrupt}} .  please do correct me if i am wrong. 

*Analysis*

{code}
   try {
          eventQueue.put(event);
      } catch (InterruptedException e) {
        if (!stopped) {
          LOG.warn(
              "AsyncDispatcher thread interrupted " + Thread.currentThread()
                  .getName(), e);
        }
        // Need to reset drained flag to true if event queue is empty,
        // otherwise dispatcher will hang on stop.
        drained = eventQueue.isEmpty();
        throw new YarnRuntimeException(e);
      }
{code}
put operation to {{LinkedBlockingQueue}} from an interrupted thread.
{code}
public void put(E e) throws InterruptedException {
..
     putLock.lockInterruptibly();
}
{code}
{code}
     public final void acquireInterruptibly(int arg)
            throws InterruptedException {
        if (Thread.interrupted())
            throw new InterruptedException();
	}
{code}

*RM switch over flow  which could shutdown RM*

Resource manager {{transitionToStandby()}}--> {{RMActiveService.stop()}} --> {{RMSecretManagerService#serviceStop()}}
->{{rmDTSecretManager.stopThreads()}}
{code}
      synchronized (noInterruptsLock) {
        tokenRemoverThread.interrupt();
      }
{code}
{{ExpiredTokenRemover}} interrupted during  {{rollMasterKey()}}  throws {{InterruptedException}}
which causes {{notifyStoreOperationFailedInternal}}   in
{{RMStateStore#StoreRMDTMasterKeyTransition}}
{code}
      try {
        LOG.info("Storing RMDTMasterKey.");
        store.storeRMDTMasterKeyState(dtEvent.getDelegationKey());
      } catch (Exception e) {
        LOG.error("Error While Storing RMDTMasterKey.", e);
        isFenced = store.notifyStoreOperationFailedInternal(e);
      }
{code}
{{store.notifyStoreOperationFailedInternal}} eventually fires {{RMFatalEvent}} from {{ExpiredTokenRemover}}
thread which is *interrupted* 
{code}
    rmDispatcher.getEventHandler().handle(
          new RMFatalEvent(RMFatalEventType.STATE_STORE_FENCED,
              failureCause));
{code}
eventually causing {{LinkedBlockingQueue#put}} to fail and *RM Exit*

*Solution:* We should skip {{notifyStoreOperationFailedInternal}} if the current thread is
interrupted which should avoid this case thoughts??

*Issue exist only in 3.0.o alpha+* since curator version was changed to {{2.12.0}} 

{code}
 public static<T> T      callWithRetry(CuratorZookeeperClient client, Callable<T>
proc) throws Exception
    {
        T               result = null;
        RetryLoop       retryLoop = client.newRetryLoop();
        while ( retryLoop.shouldContinue() )
        {
            try
            {
      ..      }
            catch ( Exception e )
            {
                *ThreadUtils.checkInterrupted(e);*
                retryLoop.takeException(e);
            }
        }
        return result;
    }
{code}

related jira HADOOP-14187 

> RM can crash during shutdown due to InterruptedException
> --------------------------------------------------------
>
>                 Key: YARN-6647
>                 URL: https://issues.apache.org/jira/browse/YARN-6647
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.0.0-alpha4
>            Reporter: Jason Lowe
>
> Noticed some tests were failing due to the JVM shutting down early.  I was able to reproduce
this occasionally with TestKillApplicationWithRMHA.  Stacktrace to follow.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message