[ https://issues.apache.org/jira/browse/MAPREDUCE-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004001#comment-13004001
]
Greg Roelofs commented on MAPREDUCE-1220:
-----------------------------------------
If you've been watching the commits go by on Owen's yahoo-merge branch, about five months'
worth of work on this was included. Unfortunately, I screwed up my most recent push to Yahoo's
internal git repo, and as a result, every internal/temporary/debug commit was exposed, which
amounts to a lot of noise (> 3 dozen extra commits).
I'll post the ~8 "real" patches corresponding to all of that here, along with a dump() method
for Progress (in common), which was useful for debugging and may be needed again. There are
also a few screenshots, but I'll probably need to scrub some internal hostnames before posting
those.
> Implement an in-cluster LocalJobRunner
> --------------------------------------
>
> Key: MAPREDUCE-1220
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1220
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Components: client, jobtracker
> Reporter: Arun C Murthy
> Assignee: Greg Roelofs
> Attachments: MAPREDUCE-1220_yhadoop20.patch, MR-1220.v2.trunk-hadoop-mapreduce.patch.txt,
MR-1220.v2.trunk-hadoop-mapreduce.patch.txt
>
>
> Currently very small map-reduce jobs suffer from latency issues due to overheads in Hadoop
Map-Reduce such as scheduling, jvm startup etc. We've periodically tried to optimize all parts
of framework to achieve lower latencies.
> I'd like to turn the problem around a little bit. I propose we allow very small jobs
to run as a single task job with multiple maps and reduces i.e. similar to our current implementation
of the LocalJobRunner. Thus, under certain conditions (maybe user-set configuration, or if
input data is small i.e. less a DFS blocksize) we could launch a special task which will run
all maps in a serial manner, followed by the reduces. This would really help small jobs achieve
significantly smaller latencies, thanks to lesser scheduling overhead, jvm startup, lack of
shuffle over the network etc.
> This would be a huge benefit, especially on large clusters, to small Hive/Pig queries.
> Thoughts?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
|