hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4331) Restarting NodeManager leaves orphaned containers
Date Thu, 05 Nov 2015 14:28:28 GMT

    [ https://issues.apache.org/jira/browse/YARN-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991729#comment-14991729

Jason Lowe commented on YARN-4331:

SAMZA-750 is discussing RM restart, but this is NM restart.  They are related but mostly independent
features, and one can be enabled without the other.  Check if yarn.nodemanager.recovery.enabled=true
on that node.  If you want to support rolling upgrades of the entire YARN cluster they both
need to be enabled, but if you simply want to restart/upgrade a NodeManager independent of
the ResourceManager then you can turn on nodemanager restart without resourcemanager restart.
 NodeManager restart should be mostly invisible to applications except for interruptions in
the auxiliary services on that node (e.g.: shuffle handler).

bq. if the application master (AM) is dead, shouldn't it be responsibility of the container
to kill itself?

That is completely application framework dependent and not the responsibility of YARN.  A
container is completely under the control of the application (i.e.: user code) and doesn't
have to have any YARN code in it at all.  Theoretically one could write an application entirely
in C or Go or whatever that generates compatible protocol buffers and adheres to the YARN
RPC protocol semantics.  No YARN code would be running at all for that application or in any
of its containers at that point.  (I know of no such applications, but it is theoretically

Also it is not a requirement that containers have an umbilical connection to the ApplicationMaster.
 That choice is up to the application, and some applications don't do this (like the distributed
shell sample YARN application).  MapReduce is an application framework that does have an umbilical
connection, but if there's a bug in that app where tasks don't properly recognize the umbilical
was severed then that's a bug in the app and not a bug in YARN.  Once the nodemanager died
on the node, YARN lost all ability to control containers on that node.  If the container decides
not to exit then that's an issue with the app more than an issue with YARN.  There's not much
YARN can do about it since YARN's actor on that node is no longer present.

If NM restart is not enabled then the nodemanager should _not_ be killed with SIGKILL.  Simply
kill it with SIGTERM and the nodemanager should attempt to kill all containers before shutting
down.  Killing the NM with SIGKILL is normally only done when performing a work-preserving
restart on the NM, and that requres that yarn.nodemanager.recovery.enabled=true on that node
to function properly.

> Restarting NodeManager leaves orphaned containers
> -------------------------------------------------
>                 Key: YARN-4331
>                 URL: https://issues.apache.org/jira/browse/YARN-4331
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, yarn
>    Affects Versions: 2.7.1
>            Reporter: Joseph
>            Priority: Critical
> We are seeing a lot of orphaned containers running in our production clusters.
> I tried to simulate this locally on my machine and can replicate the issue by killing
> I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza jobs.
> Steps:
> {quote}1. Deploy a job 
> 2. Issue a kill -9 signal to nodemanager 
> 3. We should see the AM and its container running without nodemanager
> 4. AM should die but the container still keeps running
> 5. Restarting nodemanager brings up new AM and container but leaves the orphaned container
running in the background
> {quote}
> This is effectively causing double processing of data.

This message was sent by Atlassian JIRA

View raw message