hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3811) NM restarts could lead to app failures
Date Wed, 17 Jun 2015 18:07:01 GMT

    [ https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590207#comment-14590207
] 

Jason Lowe commented on YARN-3811:
----------------------------------

bq. this is not possible to do as the NM needs to report the RPC server port during registration
- so, server start should happen before registration.
Yes, but that's a limitation in the RPC layer.  If we could bind the server before we start
it then we could know the port, register with the RM, then start the server.  IMHO the RPC
layer should support this, but I understand we'll have to work around the lack of that in
the interim.  I think we all can agree the retry exception is just a hack being used because
we can't keep the client service from serving too soon.

> NM restarts could lead to app failures
> --------------------------------------
>
>                 Key: YARN-3811
>                 URL: https://issues.apache.org/jira/browse/YARN-3811
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.7.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Critical
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has registered with
the RM. In MR, this is considered a task attempt failure. A few of these could lead to a task/job
failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message