hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Evans <ev...@yahoo-inc.com>
Subject Re: Multiple resource requests for a given node (or all nodes)?
Date Mon, 12 Dec 2011 16:44:00 GMT
I think there may be some need for a bigger redesign in how requests are made to the scheduler
because the only use case really was map/reduce at the time it was designed.  It works very
well for that purpose but has missed a few other use cases.  For example there could be something
like  HBase where it wants a specific number of nodes with no overlap on the same physical
machines (Yes you can do it now but it may take many iterations to get it right).   Or perhaps
like with MPI or Storm where they don't really care where the nodes are so long as they are
all relatively close to one another in the network topology.  Or things like with MPI where
it cannot start any processing until all of the containers are ready (gang scheduling).

It gets even more complicated if we want to support preemption like with the fair scheduler.
 Which imo is needed even more once MPI and other potentially very long lived jobs start to
coexist with shorter jobs with tight SLAs.  In order to make a good decision about what to
preempt the scheduler needs to know that if it preempts a mapper, even though it may have
been running a lot shorter time then some reducer in the same application it is likely to
slow things down further then if it preempts that reducer.  Or if it preempts an MPI node
it might was well kill the entire application and start over, unless we some how give the
scheduler the ability to tell MPI that it is going to be preempted and it needs to save its
state away.  But even then the scheduler needs to know that preempting an MPI job will cause
all progress on it, and all of the containers it is holding, to stop.

Even if we are not putting any of these scheduling features in now we need to think about
them when designing the interface to not limit ourselves and force us to change things drastically
later on.  I am just saying that I am not sure just switching to a multimap is enough.

Bobby Evans

On 12/10/11 6:21 PM, "Todd Lipcon" <todd@cloudera.com> wrote:

On Sat, Dec 10, 2011 at 12:23 PM, Patrick Wendell
<pwendell@eecs.berkeley.edu> wrote:

> What happens if an application wants to request multiple container
> types on a given node. E.g. say I need 10 2GB containers and 10 1GB
> containers, and I don't care which node they are on (i.e. RMNode.ANY).
> I really want to store 2 resource requests under RMNode.ANY in this
> case... don't I?
> Is the model just that an AM would ask for these in series?

My hunch is that this was overlooked because the resource sizes for MR
are basically set on a per-task-type level. That is, maps need X MB
and reduces need Y MB. Since maps and reduces are set at different
'priorities', they haven't conflicted.

Does it seem straightforward to change it to a multimap? Guava has a
nice implementation.

Todd Lipcon
Software Engineer, Cloudera

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message