hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mariappan Asokan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2454) Allow external sorter plugin for MR
Date Wed, 27 Apr 2011 15:50:03 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025855#comment-13025855
] 

Mariappan Asokan commented on MAPREDUCE-2454:
---------------------------------------------

Hi Owen,
  Thank you very much for your suggestion.  Originally, I was experimenting with
my code on a Cloudera distribution which is based on Apache Hadoop 0.20.2.  I
added most of my code to the mapred package.  We did some extensive testing with
an external sorter plugin and found the results very encouraging.

It is really exciting to see where Hadoop is heading for the long term.  The
contribution we are making will be useful even when all the Task related classes
are visible as public and will live out of the core packages.

I am giving more details on the proposal below.  Please feel free to comment on.

The idea is to bypass the framework's sorting on both the Map and Reduce sides.
On the Map side, it is very easy.  Just define a public interface extending the
MapOutputCollector.  Please see the attached file MapOutputSorter.java.

An abstract class called MapOutputSorterAbstract(implementing MapOutputSorter)
will be provided which acts like a conduit to invoke methods in package
protected classes in the mapred package.  I guess once Hadoop evolves and pulls
out Task related classes from the core package, this abstract class may be
unnecessary but is harmless.  The abstract class exposes methods to send
progress message, to get a Counter object, to run the Combiner, to get a Map
output file to write to, and to get a Map index file to write to.  These methods
are very thin in the sense that they use simple delegation.

On the Reduce side, defining an external sorter interface is a bit tricky.
Please refer to the attached file ReduceInputSorter.java for details.

Again there will be an abstract base class ReduceInputSorterAbstract
(implementing ReduceInputSorter) which can be extended by users to implement
the external sorter in the reduce phase. This abstract class provides methods to
send progress message, to get Counter objects, and to update shuffle client
metrics.  I had to modify MapTask.java, ReduceTask.java, Shuffle.java,
Fetcher.java, and MapOutput.java to accommodate the external sorter.

If I can work with an Apache committer, I will be more than happy to discuss the
details of all code changes.  When I moved my code from Cloudera distribution to
Apache 0.21.0, I noticed some code refactoring that went in ReduceTask.java(for
good.)  I am still merging the changes and it may take a week or two to test it
in-house before the code can be tested formally and submitted for review.  As
far as packaging is concerned, I will try to define most of the classes in
mapreduce package rather than mapred package(as I did in the Cloudera
distribution.)

I appreciate an early feedback on this from everyone.

Thank you very much.

-- Asokan


> Allow external sorter plugin for MR
> -----------------------------------
>
>                 Key: MAPREDUCE-2454
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2454
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Mariappan Asokan
>            Priority: Minor
>         Attachments: MapOutputSorter.java, ReduceInputSorter.java
>
>
> Define interfaces and some abstract classes in the Hadoop framework to facilitate external
sorter plugins both on the Map and Reduce sides.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message