hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Subramaniam Krishnan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1712) Admission Control: plan follower
Date Thu, 04 Sep 2014 00:03:53 GMT

    [ https://issues.apache.org/jira/browse/YARN-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120729#comment-14120729

Subramaniam Krishnan commented on YARN-1712:

Thanks [~leftnoteasy] for taking a look at the patch. Since [~curino] is traveling, I'll try
to answer your questions. Your understanding is very close as your steps of 1-7 are correct.
There is slight context which is missing that I will try to explain.

bq. Question: Why not do 2) after 4)? Is it better to do shrink after excluded expired reservations?

Shrinking might be required as reservations are absolute while queues express relative (%
of cluster) capacity. We need to shrink first as shrinking might result in additional expired
reservations.  The expired reservations are determined as those reservations that exist in
the scheduler but are not currently active in the Plan (post shrinking if required). I should
add that shrinking is a rare exception case when we loose large chuks of cluster capacity.

bq. 6) Sort all reservations, from less to more satisfied, and set their new entitlement.
bq. Question: Is it possible totalAssignedCapacity > 1? Could you please explain how to
avoid it happen?

We sort all reservations based on what was promised at this moment of time. That can vary
because we support skylines for reservations, i.e. varied resource requirements over time.
This is required to handle DAGs as in the case of Tez, Oozie, Hive or Pig queries as the nodes
of the DAG will have different resource needs. This is explained in detail in the tech report
we uploaded as part of YARN-1051. 
The totalAssignedCapacity will never exceed 1 because:
  1) We always release all excess capacity before starting to allocate fresh capacity.
  2) The reservations themselves are validated before being added to the Plan to ensure that
they never exceed (YARN-1709 & YARN-1711) the total capacity of the Plan. Like mentioned
above, shrinking will handle large transient cluster failures. 

One comment is,
Current compare and sort reservation is comparing (allocatedResource - guaranteedResource),
one feeling at top of my mind is, this may make larger queue can get resource easier than
small queue. Is it possible an app can get more resource than other by lying to RM that it
needs more resource when fierce competition on resource?

To prevent exactly we do our allocations starting from smallest to largest reservation queue.
We enforce sharing policies (YARN-1711) to prevent a single user/app to reserve the entire
cluster resources or cause starvation by hoarding resources.

Hope this clarifies the logic. Feel free to revert if you have any further questions.

> Admission Control: plan follower
> --------------------------------
>                 Key: YARN-1712
>                 URL: https://issues.apache.org/jira/browse/YARN-1712
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: capacityscheduler, resourcemanager
>            Reporter: Carlo Curino
>            Assignee: Carlo Curino
>              Labels: reservations, scheduler
>         Attachments: YARN-1712.1.patch, YARN-1712.patch
> This JIRA tracks a thread that continuously propagates the current state of an inventory
subsystem to the scheduler. As the inventory subsystem store the "plan" of how the resources
should be subdivided, the work we propose in this JIRA realizes such plan by dynamically instructing
the CapacityScheduler to add/remove/resize queues to follow the plan.

This message was sent by Atlassian JIRA

View raw message