hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "sandflee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering
Date Thu, 12 Nov 2015 05:45:11 GMT

    [ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001725#comment-15001725

sandflee commented on YARN-4051:

thanks [~jlowe]

Should the value be infinite by default? The concern is that if one container has issues recovering
(due to log aggregation woes or whatever) then we risk expiring all of the containers on this
node if we don't re-register with the RM within the node expiry interval. I think it makes
sense if we have also fixed the recovery paths so there aren't potentially long-running procedures
(like contacting HDFS) during the recovery process. If we haven't then we could create as
many problems as we're solving by waiting forever.
-- aggree ! I also concern this.

Why does the patch change the check interval? If it's to reduce the logging then we can better
fix that by only logging when the status changes rather than every iteration.
---yes, it's to reduce the log, since recovery is almost very fast, change it back

 Nit: A value of zero should also be treated as a disabled max time.
--  zero is to register to register to rm at once whether nm complete recover or  not,yes?

Nit: "Max time to wait NM to complete container recover before register to RM " should be
"Max time NM will wait to complete container recovery before registering with the RM".
-- corrected

> ContainerKillEvent is lost when container is  In New State and is recovering
> ----------------------------------------------------------------------------
>                 Key: YARN-4051
>                 URL: https://issues.apache.org/jira/browse/YARN-4051
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: sandflee
>            Assignee: sandflee
>            Priority: Critical
>         Attachments: YARN-4051.01.patch, YARN-4051.02.patch, YARN-4051.03.patch, YARN-4051.04.patch
> As in YARN-4050, NM event dispatcher is blocked, and container is in New state, when
we finish application, the container still alive even after NM event dispatcher is unblocked.

This message was sent by Atlassian JIRA

View raw message