hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1220) Implement an in-cluster LocalJobRunner
Date Thu, 19 Nov 2009 03:46:39 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779816#action_12779816
] 

Aaron Kimball commented on MAPREDUCE-1220:
------------------------------------------

I'm not really sure I see the utility of such a big change. Would this really be that much
higher performance than running locally on the client? The client needs to have access to
HDFS anyway in order to do things like create the InputSplits for the cluster; so what's the
advantage of running a single-threaded process on the cluster? You'd still need to do some
of the more heavyweight job-setup operations -- ship the client jar over, spawn a separate
JVM (even if it's reused for all map/reduce tasks), set up the task IPC connection to the
tasktracker, etc. You'd also be vulnerable to the inherent very large time penalty associated
with the "tasktracker polls" model of task scheduling that Hadoop uses. 

If the job is really so small that it makes sense to run it in a single thread, then I am
suspicious that the overhead described above would be overcome by running in the quasi-locality
of the cluster, vs. just staying in the client and starting immediately.

If you do need this behavior, though, then rather than build a significant new amount of internal
architecture, it occurs to me that you could probably do this all with "user level" code (maybe
something that goes in the o.a.h.mapreduce.lib package) as follows: Write a map-only job that
uses something like NLineInputFormat to create a single map task. That single map task could
then itself be used as a springboard to set up the real job (maybe you've pre-serialized the
jobconf.xml in the client and sent it to the singleton map task via the distributed cache)
and run it in the existing LocalJobRunner there. I think this approach would be a lot cleaner;
thoughts?


> Implement an in-cluster LocalJobRunner
> --------------------------------------
>
>                 Key: MAPREDUCE-1220
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1220
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: client, jobtracker
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>             Fix For: 0.22.0
>
>
> Currently very small map-reduce jobs suffer from latency issues due to overheads in Hadoop
Map-Reduce such as scheduling, jvm startup etc. We've periodically tried to optimize all parts
of framework to achieve lower latencies.
> I'd like to turn the problem around a little bit. I propose we allow very small jobs
to run as a single task job with multiple maps and reduces i.e. similar to our current implementation
of the LocalJobRunner. Thus, under certain conditions (maybe user-set configuration, or if
input data is small i.e. less a DFS blocksize) we could launch a special task which will run
all maps in a serial manner, followed by the reduces. This would really help small jobs achieve
significantly smaller latencies, thanks to lesser scheduling overhead, jvm startup, lack of
shuffle over the network etc. 
> This would be a huge benefit, especially on large clusters, to small Hive/Pig queries.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message