hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4254) ApplicationAttempt stuck for ever due to UnknowHostexception
Date Mon, 12 Oct 2015 12:51:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14953076#comment-14953076

Jason Lowe commented on YARN-4254:

Thanks for the report and patch, Bibin!

The patch seems to be trying to fix a very specific failure mode, but in practice it will
lead to a lot of AM attempt failures which isn't ideal.  Would it make more sense if the RM
simply refused to accept nodemanagers into the cluster that are unresolvable?  Also the fact
that we try forever seems broken to me.  We should be giving up at some point and failing
the attempt, whether that be due to unknown host exceptions or other persistent errors.  Checking
specifically for unknown host exception makes me think we'll just hit this type of problem
again but for some other persistent error.

> ApplicationAttempt stuck for ever due to UnknowHostexception
> ------------------------------------------------------------
>                 Key: YARN-4254
>                 URL: https://issues.apache.org/jira/browse/YARN-4254
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>         Attachments: 0001-YARN-4254.patch
> Scenario
> =======
> 1. RM HA and 5 NMs available in cluster and are working fine 
> 2. Add one more NM to the same cluster but RM /etc/hosts not updated.
> 3. Submit application to the same cluster
> If Am get allocated to the newly added NM the *application attempt will get stuck for
ever*.User will not get to know why the same happened.
> Impact
> 1.RM logs gets overloaded with exception
> 2.Application gets stuck for ever.
> Handling suggestion YARN-261 allows for Fail application attempt .
> If we fail the same next attempt could get assigned to another NM.

This message was sent by Atlassian JIRA

View raw message