hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Roelofs (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1220) Implement an in-cluster LocalJobRunner
Date Wed, 15 Sep 2010 04:29:38 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909583#action_12909583

Greg Roelofs commented on MAPREDUCE-1220:

I've been looking at what it will take to extend this from Arun's February sketch to something
that will actually schedule and run small jobs. At the moment it looks like it will largely
avoid JobTracker; most of the action seems to be in JobInProgress (particularly {{initTasks()}}
and {{obtainNew*Task()}}), TaskInProgress ({{findNew*Task()}}, {{getTaskToRun()}}, {{addRunningTask()}}),
and the scheduler ({{assignTasks()}}).

I haven't asked Arun what he originally had in mind, but it seems that there are two fairly
obvious approaches:

 * treat UberTask (or MetaTask or MultiTask or AllInOneTask or ...) as a third kind of Task,
not exactly like either MapTask or ReduceTask but a peer (more or less) to both
 * treat UberTask as a variant (subclass) of MapTask

I see the first approach as conceptually cleaner, and some of its implementation details would
be cleaner as well, but overall it's harder to implement: there are lots of places where there's
map-vs-reduce logic (occasionally with special setup/cleanup cases), and most of it would
require modifications. The second approach seems slightly hackish, but it has at least that
one big advantage: many of the map-vs-reduce bits of code would not require changes - in particular,
I don't believe it would be necessary to touch the schedulers (and we're up to, what, four
at this point?) since they'd simply see a job containing one map and no reduces.

Thoughts? (I don't yet have a strong opinion myself, though I'm interested in getting a proof
of concept running quickly so I/we can see what works and whether it's a net win in the first
place; from that perspective, the latter approach may be better.)

Either way, there will be changes needed in the UI, metrics, and other accounting to surface
the tasks-within-a-task details, as well as the obvious configuration-related changes. Retry
logic, speculation, etc., are still unclear (to me, anyway).

Stab at a couple of the other upstream questions:

bq. Will users be able to set a number of reduce tasks greater than one? Or are you limited
to at most one reduce task?

The latter, at least for now. This is somewhat related to the question of serial execution;
i.e., if it's serial, the only reason you would want multiple reduces is if a single one is
too big to fit in memory, and if that's the case, it's arguably not a "small job" anymore.

bq. i am still a little dubious about whether this is ambitious enough. particularly - why
serial execution? to me local execution == exploiting resources available in a single box.
which are 8 core today and on the way up.

Well, generally multiple cores =&gt; multiple slots, at which point you let the scheduler
figure it out. Chris mentioned that there's a JIRA to extend the LocalJobRunner in this direction
(MAPREDUCE-434?), and if the currently proposed version of _this_ one works out, an obvious
next step would be to look at similar extensions. But this is still an untested optimization,
and all the usual caveats about premature optimization apply.

I don't know how Arun looked at the problem - I'm guessing the motivation was empirical, based
on the behavior of Oozie workflows and Pig jobs - but I view it as way to make task granularity
a bit more homogeneous by packing multiple, too-small tasks into larger containers. My mental
model is that that will tend to make the scheduler more efficient - but also that trying to
do too much may begin to work at cross-purposes with the scheduler, i.e., one probably doesn't
want to create a secondary scheduler (trying to tune both halves could get very messy).

bq. it does seem that a dispatcher facility at the lowest layer that can juggle between different
hadoop clusters is useful generically (across Hive/Pig/native-Hadoop etc.) - but that's not
quite the same as what's proposed here - is it?

Nope. But I think other groups are looking at stuff like that.

> Implement an in-cluster LocalJobRunner
> --------------------------------------
>                 Key: MAPREDUCE-1220
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1220
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: client, jobtracker
>            Reporter: Arun C Murthy
>            Assignee: Greg Roelofs
>             Fix For: 0.22.0
>         Attachments: MAPREDUCE-1220_yhadoop20.patch, MR-1220.v2.trunk-hadoop-mapreduce.patch.txt,
> Currently very small map-reduce jobs suffer from latency issues due to overheads in Hadoop
Map-Reduce such as scheduling, jvm startup etc. We've periodically tried to optimize all parts
of framework to achieve lower latencies.
> I'd like to turn the problem around a little bit. I propose we allow very small jobs
to run as a single task job with multiple maps and reduces i.e. similar to our current implementation
of the LocalJobRunner. Thus, under certain conditions (maybe user-set configuration, or if
input data is small i.e. less a DFS blocksize) we could launch a special task which will run
all maps in a serial manner, followed by the reduces. This would really help small jobs achieve
significantly smaller latencies, thanks to lesser scheduling overhead, jvm startup, lack of
shuffle over the network etc. 
> This would be a huge benefit, especially on large clusters, to small Hive/Pig queries.
> Thoughts?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message