hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-201) CapacityScheduler can take a very long time to schedule containers if requests are off cluster
Date Wed, 07 Nov 2012 19:00:12 GMT

     [ https://issues.apache.org/jira/browse/YARN-201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jason Lowe updated YARN-201:

    Attachment: YARN-201.patch

Thought a bit about filtering the AM's requests based on what resources are "active" (i.e.:
nodes/racks we know about and are capable of launching containers), but it's a bit complicated
with the latch-like protocol used between the AM and the RM.  Definitely possible, just a
bit complicated.  In the interest of getting something working in this area sooner rather
than later, here's a patch that emulates the behavior in 1.x where we don't reset the scheduling
opportunities when allocating off-switch containers.

Like 1.x, this still has an initial scheduling penalty for jobs that ask for containers on
many off-cluster resources, but the job doesn't keep paying the penalty after each off-switch
container is allocated.  Ideally we shouldn't be paying the penalty at all, hence the idea
of filtering locality requests based on what is capable of being local, but we can tackle
that riskier proposition in another JIRA.
> CapacityScheduler can take a very long time to schedule containers if requests are off
> ----------------------------------------------------------------------------------------------
>                 Key: YARN-201
>                 URL: https://issues.apache.org/jira/browse/YARN-201
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: YARN-201.patch
> When a user runs a job where one of the input files is a large file on another cluster,
the job can create many splits on nodes which are unreachable for computation from the current
cluster.  The off-switch delay logic in LeafQueue can cause the ResourceManager to allocate
containers for the job very slowly.  In one case the job was only getting one container every
23 seconds, and the queue had plenty of spare capacity.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message