hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sunil G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4254) ApplicationAttempt stuck for ever due to UnknowHostexception
Date Tue, 13 Oct 2015 15:35:06 GMT

    [ https://issues.apache.org/jira/browse/YARN-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955134#comment-14955134

Sunil G commented on YARN-4254:

Hi [~bibinchundatt]
Thank you for sharing the details. As you mentioned, AM attempt is in scheduled state but
container is not yet launched here.
But container allocation is done in the NM heartbeat for this app attempt (AM container),
and yet to be pulled from RMAppAttempt AMContainerAllocatedTransition. Based on our offline
discussion, this must be failing due to the DNS lookup/etc-hosts lookup. Thus causing the
looping of attempt retries as you mentioned. In my opinion I am also agreeing with your point
of view, and this is to be handled.

Currently in some cases, there are chances that DNS may be off for a while, hence we must
retry to pull such containers again. This is done currently in FicaSchedulerApp. However in
cases like this JIRA, it will cause permanent hang for application, since container is allocated
by RM but cannot be pulled due to continuous host lookup errors.

So if we do a validation for valid host in register/heartbeat, we also must ensure that we
remove such containers from newly allocated list. OR, we could handle the exception while
trying to create container token and then remove from {{newlyAllocatedContainers}} list. Thoughts?

> ApplicationAttempt stuck for ever due to UnknowHostexception
> ------------------------------------------------------------
>                 Key: YARN-4254
>                 URL: https://issues.apache.org/jira/browse/YARN-4254
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>         Attachments: 0001-YARN-4254.patch, Logs.txt, Test.patch
> Scenario
> =======
> 1. RM HA and 5 NMs available in cluster and are working fine 
> 2. Add one more NM to the same cluster but RM /etc/hosts not updated.
> 3. Submit application to the same cluster
> If Am get allocated to the newly added NM the *application attempt will get stuck for
ever*.User will not get to know why the same happened.
> Impact
> 1.RM logs gets overloaded with exception
> 2.Application gets stuck for ever.
> Handling suggestion YARN-261 allows for Fail application attempt .
> If we fail the same next attempt could get assigned to another NM.

This message was sent by Atlassian JIRA

View raw message