hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
Date Wed, 06 May 2015 00:08:03 GMT

    [ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14529600#comment-14529600

Vinod Kumar Vavilapalli commented on YARN-1680:

Please leave out the head-room concerns w.r.t node-labels. IIRC, we had tickets at YARN-796
tracking that. It is very likely a completely different solution, so.

There is no notion of a cluster-level blacklisting in YARN. We have notions of unhealthy/lost/decommissioned
nodes in a cluster. When such events happens, Applications are already informed of these events
via the heartbeats, and the head-room automatically changes when nodes get removed that way.

Coming to the app-level blacklisting, clearly, the solution proposed is better than dead-locks.
But blindly reducing the resources corresponding to blacklisted nodes will result in under-utilization
(sometimes massively) and over-conservative scheduling requests by apps. One way to resolve
this is to get the apps (or optionally in the AMRMClient library) to deduct the resource unusable
on blacklisted nodes.

> availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes
free memory.
> ------------------------------------------------------------------------------------------------------
>                 Key: YARN-1680
>                 URL: https://issues.apache.org/jira/browse/YARN-1680
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: capacityscheduler
>    Affects Versions: 2.2.0, 2.3.0
>         Environment: SuSE 11 SP2 + Hadoop-2.3 
>            Reporter: Rohith
>            Assignee: Craig Welch
>         Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch
> There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start
is set to 1.
> Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become
unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer
task are running in cluster now.
> MRAppMaster does not preempt the reducers because for Reducer preemption calculation,
headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager
does not assing any new containers on blacklisted nodes but returns availableResouce considers
cluster free memory). 

This message was sent by Atlassian JIRA

View raw message