hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carlo Curino (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-568) FairScheduler: support for work-preserving preemption
Date Sat, 04 May 2013 17:50:17 GMT

    [ https://issues.apache.org/jira/browse/YARN-568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649126#comment-13649126

Carlo Curino commented on YARN-568:

Sandy, I agree with your summary of the FS mechanics, and you raise important questions that
I try to address below. 

The idea behind the preemption we are introducing is to prempt first and kill later to allow
the AM to "save" its work before killing (in the CS we go a step further and let the AM pick
the containers but it is a bit trickier so I would leave it out for the time being). This
requires us to be "consistent" in how we pick the containers and first ask nicely, and then
kill the same containers if the AM is ignoring us or being too slow. This is needed to give
a consistent view of the RM needs to the AM. Assuming we are being consistent in picking containers,
I think the simple mechanics we posted should be ok. 

Now how can we get there:

1) This translate in a deterministic choice of containers across invocations of the preemption
procedures. Sorting by priority is a first step in that direction (although as I commented
[here | https://issues.apache.org/jira/browse/YARN-569?focusedCommentId=13638825&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13638825]
there are some other issues with that). Adding reverse-container-ordering might help guarantee
the picking order is consistent (missing now). In particular, if the need for preemption is
consistent over time, no new containers would be granted to this app, so picking from the
"tail" should yield a consistent set of containers (minus the one naturally expiring, which
would be accounted in future run as a reduced preemption need). On the other hand if the cluster
conditions change drastically enough (e.g., big job finishes) and there is no more need to
kill some containers from this app, we save the cost of kill and reschedule. In a sense, instead
of looking at an instantaneous need for preemption every 15sec, we check every 5 seconds and
only kill when there is a sustained need for a window of >maxWaitTimeBeforeKill. I think
that if we can get this to work as intended we would get a better overall policy (less jitter).

2) toPreempt is decremented in all three cases because we would otherwise double-kill for
the same resource needs: imagine you want 5 containers and send corresponding preemption requests,
while the AMs are working on preemption, the preemption procedure is called again and re-detects
that we want 5 containers back. If you don't account for the pending requests (i.e., decrementing
toPreempt for those too) you would pick (preempt or kill) another 5 containers (depending
on time constants this could happen more than twice)... now we are forcing the AM to release
10(or more) containers for a 5 containers preemption need. Anyway, I agree that once we converge
on this we should comment it out clearly in the code, this seems the kind of code that people
would try to "fix" :-). The shift you spotted with this comment is between running "rarely
enough" so that all the actions initiated during a previous run are fully reflected in the
current cluster state, to run frequently enough that the actions we are taking might not be
visible yet. This force us to do some more bookeeping and have robust heuristics, but I think
it is work the improvement in the scheduler behavior.

3) It is probably good to have a "no-preemption" mode in which we simply straight kill. However,
by setting the time constant right (e.g., preemptionInterval 5sec and maxWaitTimeBeforeKill
to 10sec) you would get the same effect of having a hard kill at most 15sec after there is
a need for preemption, but for every preemption-aware AM we could save the progress made so
far. In our current MR implementation of preemption, you might get containers back even faster,
as we release containers once we are done checkpointing. Note that since we are not actually
killing at every preemptionInterval we could set that very low (if performance of the FS allow
it) and get more points of observation and faster reaction times, while maxWaitTimeBeforeKill
would be tuned as a tradeoff between giving the AM enough time to preempt and speed of rebalance.

I will look into adding the allocation-order as a second-level ordering for containers. Please
let me know whether this seems enough or I am missing something.

> FairScheduler: support for work-preserving preemption 
> ------------------------------------------------------
>                 Key: YARN-568
>                 URL: https://issues.apache.org/jira/browse/YARN-568
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: scheduler
>            Reporter: Carlo Curino
>            Assignee: Carlo Curino
>         Attachments: YARN-568.patch, YARN-568.patch
> In the attached patch, we modified  the FairScheduler to substitute its preemption-by-killling
with a work-preserving version of preemption (followed by killing if the AMs do not respond
quickly enough). This should allows to run preemption checking more often, but kill less often
(proper tuning to be investigated).  Depends on YARN-567 and YARN-45, is related to YARN-569.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message