hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matei Zaharia (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HADOOP-4665) Add preemption to the fair scheduler
Date Wed, 01 Apr 2009 18:22:13 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694663#action_12694663
] 

Matei Zaharia edited comment on HADOOP-4665 at 4/1/09 11:21 AM:
----------------------------------------------------------------

Hi Vinod,

A few comments:

* Is the 2-3 heartbeats for preemption really necessary? I imagine timeouts will be on the
order of minutes, so a few seconds won't make a big difference. Although thinking of the timeout
as an SLA is nice, it's also equally easy to think of it as "this is when it can start killing
tasks". To me, putting this extra 2-3 heartbeats thing in seems like unnecessary complexity.
* The reason the preemption-enabled check is deep inside the method is to give you the ability
to turn preemption off but see the SHOULD_PREEMPT log messages to figure out when your cluster
*would* preempt tasks given certain settings. We wanted this at Facebook so that we can add
some timeouts, count the SHOULD_PREEMPT messages over a week, and be sure that the settings
chosen are good without actually losing a lot of tasks if there's a mistake. I think this
is a good feature to keep for other people who are thinking of turning on preemption.
* There actually is a way to set a default preemption timeout as in Joydeep's comment - you
can set defaultMinSharePreemptionTimeout in the XML file. The code for this is in PoolManager.
* The default settings of preemptionEnabled=true and no timeouts are to make preemption easy
to configure gradually. We expect that most people will start out not wanting preemption,
because it creates an extra worry of "have we set it too low". Then as people start running
pools with "production" jobs (with min share set), they may want to enable preemption just
for these jobs. They would be able to do that by just adding a preemptionTimeout entry to
those pools in the config, and it would be active without needing to restart the JobTracker.
Then if they see non-production jobs suffering, they could enable the fairSharePreemptionTimeout,
again without requiring a cluster restart. The only reason to also provide a preemptionEnabled
setting in the jobconf is for the testing purpose I mentioned above, where an organization
switching over to preemption in production can figure out first whether it will kill too many
tasks. Overall, my goal with all the fair scheduler config is to make it as easy as possible
to use "out of the box". You don't need to define pools in advance, you don't need to define
min shares or weights in advance, you don't need to decide when to use preemption in advance,
etc, and the only setting you need in mapred-site.xml is the one that sets Hadoop to use the
fair scheduler. Then as you decide you want the more advanced features, you enable them gradually.
I actually think there are strong advantages to this over your proposal of having preemptionEnabled=false
and having non-infinite default timeouts so again I'd like more motivation before making such
a large code change. The other factor is that Facebook has been using the current version
of the preemption code and found the current features useful.

I'll take a look at your other comments later this week. Regarding the code reuse in preemptTasks,
it is indeed based on the one in the capacity scheduler, but I'd like to make refactoring
that a separate issue from this JIRA. The right thing might be to have some of that functionality
in TaskScheduler.

      was (Author: matei):
    Hi Vivek,

A few comments:

* Is the 2-3 heartbeats for preemption really necessary? I imagine timeouts will be on the
order of minutes, so a few seconds won't make a big difference. Although thinking of the timeout
as an SLA is nice, it's also equally easy to think of it as "this is when it can start killing
tasks". To me, putting this extra 2-3 heartbeats thing in seems like unnecessary complexity.
* The reason the preemption-enabled check is deep inside the method is to give you the ability
to turn preemption off but see the SHOULD_PREEMPT log messages to figure out when your cluster
*would* preempt tasks given certain settings. We wanted this at Facebook so that we can add
some timeouts, count the SHOULD_PREEMPT messages over a week, and be sure that the settings
chosen are good without actually losing a lot of tasks if there's a mistake. I think this
is a good feature to keep for other people who are thinking of turning on preemption.
* There actually is a way to set a default preemption timeout as in Joydeep's comment - you
can set defaultMinSharePreemptionTimeout in the XML file. The code for this is in PoolManager.
* The default settings of preemptionEnabled=true and no timeouts are to make preemption easy
to configure gradually. We expect that most people will start out not wanting preemption,
because it creates an extra worry of "have we set it too low". Then as people start running
pools with "production" jobs (with min share set), they may want to enable preemption just
for these jobs. They would be able to do that by just adding a preemptionTimeout entry to
those pools in the config, and it would be active without needing to restart the JobTracker.
Then if they see non-production jobs suffering, they could enable the fairSharePreemptionTimeout,
again without requiring a cluster restart. The only reason to also provide a preemptionEnabled
setting in the jobconf is for the testing purpose I mentioned above, where an organization
switching over to preemption in production can figure out first whether it will kill too many
tasks. Overall, my goal with all the fair scheduler config is to make it as easy as possible
to use "out of the box". You don't need to define pools in advance, you don't need to define
min shares or weights in advance, you don't need to decide when to use preemption in advance,
etc, and the only setting you need in mapred-site.xml is the one that sets Hadoop to use the
fair scheduler. Then as you decide you want the more advanced features, you enable them gradually.
I actually think there are strong advantages to this over your proposal of having preemptionEnabled=false
and having non-infinite default timeouts so again I'd like more motivation before making such
a large code change. The other factor is that Facebook has been using the current version
of the preemption code and found the current features useful.

I'll take a look at your other comments later this week. Regarding the code reuse in preemptTasks,
it is indeed based on the one in the capacity scheduler, but I'd like to make refactoring
that a separate issue from this JIRA. The right thing might be to have some of that functionality
in TaskScheduler.
  
> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, hadoop-4665-v1b.patch,
hadoop-4665-v2.patch, hadoop-4665-v3.patch, hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might
submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may
be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has
a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance
for interactive jobs on average through fair sharing. Therefore, it will support preempting
under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds (e.g. less
than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees
quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed
capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the following
heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, ordering
these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the copy phase
where there is not much map output per task (at Facebook, we have observed this to be the
main time we need preemption - when a job has a long map phase and its reducers are mostly
sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message