hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Sutter" <sut...@gmail.com>
Subject Re: Task type priorities during scheduling ?
Date Mon, 24 Jul 2006 10:28:54 GMT

it doesnt matter how the code is structured, what does matter is that
the reduce phase and shuffle phase have very different timelines and
resource requirements and should not both be charged the the number of
reduce tasks permitted.

it should be possible to have lots of tasks in the shuffle phase
(mostly, sitting around waiting for mappers to run), but only have
"about" one actual reduce phase running per cpu (or whatever works for
each of our apps) that gets enough memory for a sorter, does
substantial computation, etc.

maybe thats what you meant, and if so apologies, just wanted to be clear.

i'm sure that can be done with a single task/thread that does both
phases, and thats probably the simplest way to code it.


On 7/24/06, Doug Cutting <cutting@apache.org> wrote:
> Eric Baldeschwieler wrote:
> > Of course interleaving the sort with the copy phase would also be
> > desirable...
> >
> > But I'm all for clearly IDing reduces vs shuffle.
> I think this is mostly a terminology problem.
> There is a 1:1 correspondence between shuffle tasks and reduce tasks,
> and a strict ordered dependency.  There's no advantage in trying to
> separate their implementations: we need to start a thread to manage
> first a shuffle and then, immediately after, if the shuffle suceeds, a
> reduce.  So this may as well be the same thread.
> So I don't think we need a ShuffleTask class, separately scheduled by
> the TaskTracker, but, rather, we just need to start calling the first
> part of the reduce task progress "shuffle".  Thus the fix is only to
> progress reporting code.
> Doug

View raw message