hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matei Zaharia (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HADOOP-4667) Global scheduling in the Fair Scheduler
Date Sun, 25 Jan 2009 08:56:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667061#action_12667061
] 

matei edited comment on HADOOP-4667 at 1/25/09 12:54 AM:
-----------------------------------------------------------------

Having some kind of SchedulingInfo data structure associated with each job might be a good
start towards separating scheduling stuff from non-scheduling stuff and ultimately having
a common scheduler codebase. It could be a field within the JobInProgress, and maybe schedulers
would be allowed to extend the base class to have their own info attributes. Is this the kind
of thing you're proposing?

I'm still somewhat wary about making obtainNewMapTask sometimes not return tasks due to this
scheduling opportunity stuff though. It seems like the more things we add to it, the harder
it will be to break the cycle and switch to a saner API (hasMapTask and createMapTask for
example). Furthermore, once a technique like this is in the JobInProgress class, it's hard
to try out other methods for achieving the same goal. One of the nicer things about having
the scheduler API is that although it makes the codebase more fragmented, it's enabled us
to experiment with stuff like this. As a concrete example, once this basic patch is finished,
I want to try a refinement for dealing with hotspot nodes that will launch IO-intensive tasks
preferentially on those nodes to maximize the rate of local IO. Would it make sense to be
working in JobInProgress for that? So I'd prefer if there was provide this functionality to
all the schedulers without ingraining it in JobInProgress, or at least without blocking the
road towards changes to this policy. Perhaps figuring out how to split up obtainNewMapTask
into something generic that everyone can use (give me a task now) and something smarter (count
attempts and handle the wait for me) and perhaps even a version that just says whether there
is a task with the given locality level without initializing it would be possible without
significant code changes. Does that make sense or do you think it's better to leave the API
exactly the same and do this by default?

      was (Author: matei):
    Having some kind of SchedulingInfo data structure associated with each job might be a
good start towards separating scheduling stuff from non-scheduling stuff and ultimately having
a common scheduler codebase. It could be a field within the JobInProgress, and maybe schedulers
would be allowed to extend the base class to have their own info attributes. Is this the kind
of thing you're proposing?

I'm still somewhat wary about making obtainNewMapTask sometimes not return tasks due to this
scheduling opportunity stuff though. It seems like the more things we add to it, the harder
it will be to break the cycle and switch to a saner API (hasMapTask and createMapTask for
example). Furthermore, once a technique like this is in the JobInProgress class, it's hard
to try out other methods for achieving the same goal. One of the nicer things about having
the scheduler API is that although it makes the codebase more fragmented, it's enabled us
to experiment with stuff like this. As a concrete example, once this basic patch is finished,
I want to try an example, once this basic patch is in, I want to try a refinement for dealing
with hotspot nodes that will launch IO-intensive tasks preferentially on those nodes to maximize
the rate of local IO. Would it make sense to be working in JobInProgress for that? So I'd
prefer if there was provide this functionality to all the schedulers without ingraining it
in JobInProgress, or at least without blocking the road towards changes to this policy. Perhaps
figuring out how to split up obtainNewMapTask into something generic that everyone can use
(give me a task now) and something smarter (count attempts and handle the wait for me) and
perhaps even a version that just says whether there is a task with the given locality level
without initializing it would be possible without significant code changes. Does that make
sense or do you think it's better to leave the API exactly the same and do this by default?
  
> Global scheduling in the Fair Scheduler
> ---------------------------------------
>
>                 Key: HADOOP-4667
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4667
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>         Attachments: fs-global-v0.patch, HADOOP-4667_api.patch
>
>
> The current schedulers in Hadoop all examine a single job on every heartbeat when choosing
which tasks to assign, choosing the job based on FIFO or fair sharing. There are inherent
limitations to this approach. For example, if the job at the front of the queue is small (e.g.
10 maps, in a cluster of 100 nodes), then on average it will launch only one local map on
the first 10 heartbeats while it is at the head of the queue. This leads to very poor locality
for small jobs. Instead, we need a more "global" view of scheduling that can look at multiple
jobs. To resolve the locality problem, we will use the following algorithm:
> - If the job at the head of the queue has no node-local task to launch, skip it and look
through other jobs.
> - If a job has waited at least T1 seconds while being skipped, also allow it to launch
rack-local tasks.
> - If a job has waited at least T2 > T1 seconds, also allow it to launch off-rack tasks.
> This algorithm improves locality while bounding the delay that any job experiences in
launching a task.
> It turns out that whether waiting is useful depends on how many tasks are left in the
job - the probability of getting a heartbeat from a node with a local task - and on whether
the job is CPU or IO bound. Thus there may be logic for removing the wait on the last few
tasks in the job.
> As a related issue, once we allow global scheduling, we can launch multiple tasks per
heartbeat, as in HADOOP-3136. The initial implementation of HADOOP-3136 adversely affected
performance because it only launched multiple tasks from the same job, but with the wait rule
above, we will only do this for jobs that are allowed to launch non-local tasks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message