hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Craig Welch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
Date Thu, 02 Oct 2014 22:09:36 GMT

    [ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157286#comment-14157286

Craig Welch commented on YARN-1680:

This does bring up what I think could be an issue, I'm not sure if it was what you were getting
at before or not, [~john.jian.fang], but we could well be introducing a new bug here unless
we are careful.  I don't see any connection between the scheduler level resource adjustments
and the application level adjustments, so if an application had problems with a node and blacklisted
it, and then the cluster did, the resource value of the node would be effectively removed
from the headroom 2x (once when the application adds it to it's new "blacklist reduction",
and a second time when the cluster removes it's value from the clusterResource).  I think
this could be a problem, I think it could be addressed, but it's something to think about
and I don't think the current approach addresses this- [~airbots], [~jlowe], thoughts?

> availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes
free memory.
> ------------------------------------------------------------------------------------------------------
>                 Key: YARN-1680
>                 URL: https://issues.apache.org/jira/browse/YARN-1680
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>    Affects Versions: 2.2.0, 2.3.0
>         Environment: SuSE 11 SP2 + Hadoop-2.3 
>            Reporter: Rohith
>            Assignee: Chen He
>         Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch
> There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start
is set to 1.
> Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become
unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer
task are running in cluster now.
> MRAppMaster does not preempt the reducers because for Reducer preemption calculation,
headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager
does not assing any new containers on blacklisted nodes but returns availableResouce considers
cluster free memory). 

This message was sent by Atlassian JIRA

View raw message