hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
Date Wed, 06 May 2015 00:42:01 GMT

    [ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14529654#comment-14529654

Wangda Tan commented on YARN-1680:

I think we should stop adding such application-specific logic into RM, application can have
very varied resource request, for example:
- Relax locality, this is very similar to "white list"
- Black list.
- In the future we can have affinity/anti-affinity/constraints.
We cannot do so much expensive calculation in centralized way.

RM should only take care of general limits, such as user-limits/queue-limits, like what we
have now.

I propose to
- In short term, treat the headroom just a hint, like what [~kasha] mentioned: https://issues.apache.org/jira/browse/MAPREDUCE-6302?focusedCommentId=14494728&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14494728.
I'm not sure if MAPREDUCE-6302 solved the problem already, I haven't looked at the patch yet.
- In longer term, support headroom calculation in client-side utils, maybe AMRMClient is a
good place.

> availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes
free memory.
> ------------------------------------------------------------------------------------------------------
>                 Key: YARN-1680
>                 URL: https://issues.apache.org/jira/browse/YARN-1680
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: capacityscheduler
>    Affects Versions: 2.2.0, 2.3.0
>         Environment: SuSE 11 SP2 + Hadoop-2.3 
>            Reporter: Rohith
>            Assignee: Craig Welch
>         Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch
> There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start
is set to 1.
> Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become
unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer
task are running in cluster now.
> MRAppMaster does not preempt the reducers because for Reducer preemption calculation,
headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager
does not assing any new containers on blacklisted nodes but returns availableResouce considers
cluster free memory). 

This message was sent by Atlassian JIRA

View raw message