hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
Date Tue, 07 Oct 2014 01:01:35 GMT

    [ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161300#comment-14161300
] 

Hadoop QA commented on YARN-1857:
---------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12673238/YARN-1857.6.patch
  against trunk revision 519e5a7.

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 1 new or modified
test files.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of
javac compiler warnings.

    {color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new Findbugs (version
2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number
of release audit warnings.

    {color:red}-1 core tests{color}.  The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

                  org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

    {color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5292//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5292//console

This message is automatically generated.

> CapacityScheduler headroom doesn't account for other AM's running
> -----------------------------------------------------------------
>
>                 Key: YARN-1857
>                 URL: https://issues.apache.org/jira/browse/YARN-1857
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: capacityscheduler
>    Affects Versions: 2.3.0
>            Reporter: Thomas Graves
>            Assignee: Chen He
>            Priority: Critical
>         Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, YARN-1857.4.patch,
YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.patch, YARN-1857.patch, YARN-1857.patch
>
>
> Its possible to get an application to hang forever (or a long time) in a cluster with
multiple users.  The reason why is that the headroom sent to the application is based on the
user limit but it doesn't account for other Application masters using space in that queue.
 So the headroom (user limit - user consumed) can be > 0 even though the cluster is 100%
full because the other space is being used by application masters from other users.  
> For instance if you have a cluster with 1 queue, user limit is 100%, you have multiple
users submitting applications.  One very large application by user 1 starts up, runs most
of its maps and starts running reducers. other users try to start applications and get their
application masters started but not tasks.  The very large application then gets to the point
where it has consumed the rest of the cluster resources with all reduces.  But at this point
it needs to still finish a few maps.  The headroom being sent to this application is only
based on the user limit (which is 100% of the cluster capacity) its using lets say 95% of
the cluster for reduces and then other 5% is being used by other users running application
masters.  The MRAppMaster thinks it still has 5% so it doesn't know that it should kill a
reduce in order to run a map.  
> This can happen in other scenarios also.  Generally in a large cluster with multiple
queues this shouldn't cause a hang forever but it could cause the application to take much
longer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message