hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mariappan Asokan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2454) Allow external sorter plugin for MR
Date Thu, 05 May 2011 21:53:03 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029596#comment-13029596

Mariappan Asokan commented on MAPREDUCE-2454:

Hi Owen,
  Thanks for your comments.  I like your suggestion on the signature of initialize() method
and also not having a flush().  However, I prefer to pass the Key and Value as objects instead
of serialized ByteArray for the following reasons:
* It is easier and more efficient when external program(like UNIX sort command) is invoked
as a sorter.  The Key and Value types will be Text.  The bytes in the Text can be grabbed
and passed to the program with a TAB between them.  There is no need to deserialize data passed
in the ByteArray.  This is similar to what is happening with hadoop streaming when for example
a Mapper is implemented by an external program.  Also, on the Map side the output of the mapper
is key and value objects which can be directly passed to the sorter.  Thus there is no need
for extra serializtion/deserialization.  Similar argument applies when output of the sorter
is read on the Reduce side using RecordReader.
* The framework's serialization is in no way affected.  It is free to replace the serialization
layer.  The external sorter can store the sorted output as simple UNIX text records in the
final map output file since it will deal with the shuffled data on the Reduce side.
* For the RecordReader, I think it is better to change the signature of getKey() and getValue()
as below:
Object getKey(Object key) // If key is null, it will be allocated first.
Object getValue(Object value) // If value is null, it will allocated first.
The reasons for these signatures are:
   ** The RecordReader will be used for running Combiner and Reducer.  This may involve saving
the last seen key.  If the caller passes the key object, it can just save the object handle
not the entire object since it owns the object.  If the callee is returning its own object,
it is ephemeral and so the caller has to save it which results in extra copying.
   ** Creating an adapter to return key and value objects from their serialized counterparts(that
is from RawKeyValueIterator) will not result in any extra data copying.  So the performance
of the framework's sorter will not degrade.

Owen, do you have any suggestion on a committer with whom I can work on this?

> Allow external sorter plugin for MR
> -----------------------------------
>                 Key: MAPREDUCE-2454
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2454
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Mariappan Asokan
>            Priority: Minor
>         Attachments: KeyValueIterator.java, MapOutputSorter.java, MapOutputSorterAbstract.java,
> Define interfaces and some abstract classes in the Hadoop framework to facilitate external
sorter plugins both on the Map and Reduce sides.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message