hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4331) Restarting NodeManager leaves orphaned containers
Date Thu, 05 Nov 2015 11:57:27 GMT

    [ https://issues.apache.org/jira/browse/YARN-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991575#comment-14991575
] 

Joseph commented on YARN-4331:
------------------------------

[~jlowe] Thanks for your comments, very helpful.
yarn.resourcemanager.work-preserving-recovery.enabled is indeed set to false. The reason we
have set it to false is because we run samza jobs on the yarn cluster and they don't work
well with this feature turned on (https://issues.apache.org/jira/browse/SAMZA-750).

Apologies for my ignorance in this area, but if the application master (AM) is dead, shouldn't
it be responsibility of the container to kill itself? I'd imagine every container should be
required to heartbeat to its application master and killing itself if it misses a few?


> Restarting NodeManager leaves orphaned containers
> -------------------------------------------------
>
>                 Key: YARN-4331
>                 URL: https://issues.apache.org/jira/browse/YARN-4331
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, yarn
>    Affects Versions: 2.7.1
>            Reporter: Joseph
>            Priority: Critical
>
> We are seeing a lot of orphaned containers running in our production clusters.
> I tried to simulate this locally on my machine and can replicate the issue by killing
nodemanager.
> I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza jobs.
> Steps:
> {quote}1. Deploy a job 
> 2. Issue a kill -9 signal to nodemanager 
> 3. We should see the AM and its container running without nodemanager
> 4. AM should die but the container still keeps running
> 5. Restarting nodemanager brings up new AM and container but leaves the orphaned container
running in the background
> {quote}
> This is effectively causing double processing of data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message