hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chen He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
Date Mon, 02 Feb 2015 19:07:37 GMT

    [ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14301703#comment-14301703
] 

Chen He commented on YARN-1680:
-------------------------------

Hi [~cwelch], thank you for the comments. 

{quote}Unfortunately, I don't think that this can be solved with checks during addition and
removal - I believe that we will need to keep a persistent picture of all blacklisted nodes
for an application regardless of their cluster state because the two can vary independently
and changes after a blacklist request may invalidate things{quote}

Yes, I agree. However, as [~jlowe] suggested, just simply fix the headroom calculation without
introducing new mechanism or facts can help some clusters that are not using label scheduling.
Can we leave the new feature and fact introduction in YARN-2848, and just fix the headroom
calculation here? 

{quote} (for example, cluster blacklists just before app blacklists, the app blacklist request
is discarded, the cluster reinstates but the app still cannot use the node for reasons different
from the nodes cluster availability - we will still include that node in headroom incorrectly...).{quote}

Do you mean this scenario:
Cluster just add Node A into its blacklist. A millisecond later, app requests so but suddenly
discards it. (But why it is discarded? according to my proposed idea?)

I worte all possible conditions here:
1. Cluster add A to blacklist, app never requests so, then, we just remove A's resource from
clusterResource, this has been covered in my previous patch;
2. Cluster add A to blacklist, app adds so, according to previous solution, A's resource can
be subtract from clusterResource twice, it is incorrect, we need only subtract it once. 
3. Cluster does not add A, app adds A, it is normal case. Not a problem.
4. Cluster add A to blacklist, app did so, but a millisecond later Cluster removes A (a node
becomes healthy suddenly?), we only need to subtract A's resource once,  
5. Cluster does not add A, app adds A, but during the headroom calculation, Cluster add A,
We may get incorrect headroom anyway, but we can finally get a correct headroom in the next
heartbeat. 

Please let me know if there is any scenario that I did not cover. 

> availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes
free memory.
> ------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-1680
>                 URL: https://issues.apache.org/jira/browse/YARN-1680
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>    Affects Versions: 2.2.0, 2.3.0
>         Environment: SuSE 11 SP2 + Hadoop-2.3 
>            Reporter: Rohith
>            Assignee: Chen He
>         Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch
>
>
> There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start
is set to 1.
> Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become
unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer
task are running in cluster now.
> MRAppMaster does not preempt the reducers because for Reducer preemption calculation,
headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager
does not assing any new containers on blacklisted nodes but returns availableResouce considers
cluster free memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message