hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Graves (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
Date Wed, 19 Mar 2014 20:35:42 GMT
Thomas Graves created YARN-1857:

             Summary: CapacityScheduler headroom doesn't account for other AM's running
                 Key: YARN-1857
                 URL: https://issues.apache.org/jira/browse/YARN-1857
             Project: Hadoop YARN
          Issue Type: Bug
          Components: capacityscheduler
    Affects Versions: 2.3.0
            Reporter: Thomas Graves

Its possible to get an application to hang forever (or a long time) in a cluster with multiple
users.  The reason why is that the headroom sent to the application is based on the user limit
but it doesn't account for other Application masters using space in that queue.  So the headroom
(user limit (100%) - user consumed) can be > 0 even though the cluster is 100% full because
the other space is being used by application masters from other users.  

For instance if you have a cluster with 1 queue, user limit is 100%, you have multiple users
submitting applications.  One very large application by user 1 starts up, runs most of its
maps and starts running reducers. other users try to start applications and get their application
masters started but not tasks.  The very large application then gets to the point where it
has consumed the rest of the cluster resources with all reduces.  But at this point it needs
to still finish a few maps.  The headroom being sent to this application is only based on
the user limit (which is 100% of the cluster capacity) its using lets say 95% of the cluster
for reduces and then other 5% is being used by other users running application masters.  The
MRAppMaster thinks it still has 5% so it doesn't know that it should kill a reduce in order
to run a map.  

This can happen in other scenarios also.  Generally in a large cluster with multiple queues
this shouldn't cause a hang forever but it could cause the application to take much longer.

This message was sent by Atlassian JIRA

View raw message