hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1220) Implement an in-cluster LocalJobRunner
Date Thu, 19 Nov 2009 11:35:39 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779946#action_12779946

Aaron Kimball commented on MAPREDUCE-1220:


My concerns are primarily based around how much code would need to change in the tasktrackers
to work with a new type of job runner, etc. As such, I suggested a possible internal implementation
model that does not require building a new JobRunner nor changing how child tasks interface
with the tasktracker. I understand why you want this to integrate cleanly into user-level
job scheduling/configuration. So to that end, please feel free to suggest a clean interface
which goes on top of this for the client to work with.

For example, I think that we could add a {{Job.setSingleProcess()}} method which users use
to configure a job in this mode; it would then create another {{Job}} internally that uses
the process I described above to bootstrap into the real Job. The actual mechanics of how
to manage the subprocess are still done in a "regular" map task that itself uses the LocalJobRunner.
Does this make sense?

> Implement an in-cluster LocalJobRunner
> --------------------------------------
>                 Key: MAPREDUCE-1220
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1220
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: client, jobtracker
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>             Fix For: 0.22.0
> Currently very small map-reduce jobs suffer from latency issues due to overheads in Hadoop
Map-Reduce such as scheduling, jvm startup etc. We've periodically tried to optimize all parts
of framework to achieve lower latencies.
> I'd like to turn the problem around a little bit. I propose we allow very small jobs
to run as a single task job with multiple maps and reduces i.e. similar to our current implementation
of the LocalJobRunner. Thus, under certain conditions (maybe user-set configuration, or if
input data is small i.e. less a DFS blocksize) we could launch a special task which will run
all maps in a serial manner, followed by the reduces. This would really help small jobs achieve
significantly smaller latencies, thanks to lesser scheduling overhead, jvm startup, lack of
shuffle over the network etc. 
> This would be a huge benefit, especially on large clusters, to small Hive/Pig queries.
> Thoughts?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message