hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matei Zaharia (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-2205) FairScheduler should only preempt tasks for pools/jobs that are up next for scheduling
Date Tue, 30 Nov 2010 03:02:11 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965079#action_12965079

Matei Zaharia commented on MAPREDUCE-2205:

I'll have to look at this code in trunk more carefully, but I think you're right that cases
can arise in which the wrong job is scheduled. The main one I notice is when several jobs
are below their min or fair share and one of them times out. When this job times out, the
FairShareComparator looks at all the jobs in order of min or fair share ratio and picks the
one with the lowest one to launch. However, that job may itself have been launched quite a
bit later than the one that timed out, so the timed out job needs to wait even longer. The
issue is going to be worse if some pools have preemption enabled and some don't, or if pools
have different preemption timeouts.

I think fixing this might require a slightly different approach than you proposed because
of the semantics we want for timeouts. We want preemption to occur only if a job (or pool)
has not been serviced for X seconds, in which case the preempted resources should go to that
job. If we just sort the jobs using FairShareComparator, we may miss the fact that a job later
in the fair share order has actually timed out and requires preemption now. Instead, it would
be better to change FairShareComparator (and its equivalent in Facebook's 0.20) to sort jobs
by whether they are past their preemption timeout first. We should also think about whether
it's best to do this in the comparator or outside of it. I think one of the cleaner solution
would be to do this prioritization outside the comparator (i.e. sort the jobs and then pull
out the starved ones), because this way, we don't need to modify all the comparators to take
into account preemption timeouts.

So in summary, I'd propose the following approach:
* In the FairScheduler object, keep track of which pools are currently starved and past their
preemption timeouts. This could be as simple as calling tasksToPreempt() on each heartbeat
or more complicated if we want to cache this value somehow.
* In FairScheduler.assignTasks, subdivide the pools into timed-out and non-timed-out ones
and prioritize assigning tasks to the former. We can still use the FairShareComparator to
sort the pools of each type. At the end of the day, all pools should be put into a global
order and the assignTasks method can proceed as normal.

> FairScheduler should only preempt tasks for pools/jobs that are up next for scheduling
> --------------------------------------------------------------------------------------
>                 Key: MAPREDUCE-2205
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2205
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/fair-share
>            Reporter: Joydeep Sen Sarma
> We have hit a problem with the preemption implementation in the FairScheduler where the
following happens:
> # job X runs short of fair share or min share and requests/causes N tasks to be preempted
> # when slots are then scheduled - tasks from some other job are actually scheduled
> # after preemption_interval has passed, job X finds it's still underscheduled and requests
preemption. goto 1.
> This has caused widespread preemption of tasks and the cluster going from high utilization
to low utilization in a few minutes.
> Some of the problems are specific to our internal version of hadoop (still 0.20 and doesn't
have the hierarchical FairScheduler) - but i think the issue here is generic (just took a
look at the trunk assignTasks and tasksToPreempt routines). The basic problem seems to be
that the logic of assignTasks+FairShareComparator is not consistent with the logic in tasksToPreempt().
The latter can choose to preempt tasks on behalf of jobs that may not be first up for scheduling
based on the FairComparator. Understanding whether these two separate pieces of logic are
consistent and keeping it that way is difficult.
> It seems that a much safer preemption implementation is to walk the jobs in the order
they would be scheduled on the next heartbeat - and only preempt for jobs that are at the
head of this sorted queue. In MAPREDUCE-2048 - we have already introduced a pre-sorted list
of jobs ordered by current scheduling priority. It seems much easier to preempt only jobs
at the head of this sorted list.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message