hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Riccomini (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-614) Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1
Date Fri, 03 May 2013 17:56:17 GMT

    [ https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648624#comment-13648624
] 

Chris Riccomini commented on YARN-614:
--------------------------------------

Looking into #1 a bit more.

The AM's finished container is added in RMAppAttemptImpl.AMFinishingContainerFinishedTransition.

{code}
appAttempt.justFinishedContainers.add(containerStatus);
{code}

Which is handled in this transition in RMAppAttemptImpl:

{code}
      .addTransition(RMAppAttemptState.FINISHING,
          EnumSet.of(RMAppAttemptState.FINISHING, RMAppAttemptState.FINISHED),
          RMAppAttemptEventType.CONTAINER_FINISHED,
          new AMFinishingContainerFinishedTransition())
{code}

The RMAppAttemptEventType.CONTAINER_FINISHED event is triggered by RMAppAttemptContainerFinishedEvent:

{code}
  public RMAppAttemptContainerFinishedEvent(ApplicationAttemptId appAttemptId, 
      ContainerStatus containerStatus) {
    super(appAttemptId, RMAppAttemptEventType.CONTAINER_FINISHED);
    this.containerStatus = containerStatus;
  }
{code}

Which is triggered by two transitions in RMContainerImpl: ContainerFinishedAtAcquiredState
and KillTransition. During failure scenarios, only KillTransition is triggered. It's triggered
by:

{code}
RMContainerEventType.RELEASED
RMContainerEventType.EXPIRE
RMContainerEventType.KILL
{code}

>From RMContainerEventType:

{code}
  // Source: SchedulerApp
  START,
  ACQUIRED,
  KILL, // Also from Node on NodeRemoval
  RESERVED,

  LAUNCHED,
  FINISHED,

  // Source: ApplicationMasterService->Scheduler
  RELEASED,

  // Source: ContainerAllocationExpirer  
  EXPIRE
{code}

When a node is lost, the scheduler triggers the KILL signal (see removeNode in FairScheduler,
FifoScheduler, and CapacityScheduler).

So it looks like KILL is triggered by NodeRemoval, which happens when a node fails. I believe
this means that the AM's container will be added to justFinishedContainers when a node is
lost.

                
> Retry attempts automatically for hardware failures or YARN issues and set default app
retries to 1
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-614
>                 URL: https://issues.apache.org/jira/browse/YARN-614
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Bikas Saha
>         Attachments: YARN-614-0.patch, YARN-614-1.patch, YARN-614-2.patch
>
>
> Attempts can fail due to a large number of user errors and they should not be retried
unnecessarily. The only reason YARN should retry an attempt is when the hardware fails or
YARN has an error. NM failing, lost NM and NM disk errors are the hardware errors that come
to mind.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message