hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Craig Welch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
Date Thu, 02 Oct 2014 22:03:34 GMT

    [ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157271#comment-14157271

Craig Welch commented on YARN-1680:

[~john.jian.fang] I should probably not have referred to the cluster level adjustments as
"blacklisting".  What I see is a mechanism (state machine, events, including adding and removing
nodes and the "unhealthy" state/the health monitor) that, I think, ultimately result in the
CapacityScheduler.addNode() and removeNode() calls, which modify the clusterResource value.
 In any case, the blacklisting functionality we are addressing here definitely looks to be
application specific needs to be addressed at that level.  The issue isn't, so far as I know,
related to any blacklisting/node health issues outside the one in play here, as those should
work properly for headroom as they adjust the cluster resource.  The problem is that the application
blacklist activity does not adjust the cluster resource and was previously not involved in
the headroom calculation.  If it's not the case that cluster level adjustments are being made
for nodes then this blacklisting will result in duplication among applications as they independently
discover problems with nodes and blacklist them, but that is not a new characteristic of the
way the system works.

> availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes
free memory.
> ------------------------------------------------------------------------------------------------------
>                 Key: YARN-1680
>                 URL: https://issues.apache.org/jira/browse/YARN-1680
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>    Affects Versions: 2.2.0, 2.3.0
>         Environment: SuSE 11 SP2 + Hadoop-2.3 
>            Reporter: Rohith
>            Assignee: Chen He
>         Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch
> There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start
is set to 1.
> Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become
unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer
task are running in cluster now.
> MRAppMaster does not preempt the reducers because for Reducer preemption calculation,
headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager
does not assing any new containers on blacklisted nodes but returns availableResouce considers
cluster free memory). 

This message was sent by Atlassian JIRA

View raw message