hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mariappan Asokan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4808) Refactor MapOutput and MergeManager to facilitate reuse by Shuffle implementations
Date Thu, 17 Jan 2013 23:04:15 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556705#comment-13556705
] 

Mariappan Asokan commented on MAPREDUCE-4808:
---------------------------------------------

Hi Arun,
  I will try to explain a simple use case of an external implementation of merge on the reduce
side.  Let us say this merge implementation has some fixed area of memory (Java byte array)
allocated to store the shuffled data.  This may be done to avoid frequent garbage collection
by JVM or for better processor cache efficiency.

Looking at the methods in the {{Merge}} class, they either accept input to the merge in disk
files(array of {{Path}} objects) or memory segments(list of {{Segment}} objects.)  The former
is not suitable since merge is done in memory first and any intermediate merged output file
is under the control of the plugin implementation.  The latter is not suitable because memory
for the shuffled data is not under the control of the plugin implementation.

Ideally, if an {{InputStream}} object is available, the external implementation can read shuffled
data from the stream to the fixed area of memory at a specific offset in the byte array.

With the {{MergeManagerPlugin,}} the external implementation will get the HTTP connection's
{{InputStream}} object via the {{shuffle()}} method in {{MapOutput}} object.  In addition,
if merge goes though multiple passes because the memory area is limited in size, there should
be some way for the {{Shuffle}} to wait until memory is released by a merge pass.  There is
no method in {{Merge}} for that either.

I find that it is possible to define the interaction points between current {{Shuffle}} and
{{MergeManager}} using the {{MergeManagerPlugin}} interface.  The plugin interface has only
three methods and it allows the external plugin to have a lot of freedom in its implementation.
 As a side effect, the {{MapOutput}} is also refactored.

Hope I explained this well.  If you have any questions, please let me know.

-- Asokan

                
> Refactor MapOutput and MergeManager to facilitate reuse by Shuffle implementations
> ----------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4808
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4808
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Arun C Murthy
>            Assignee: Mariappan Asokan
>         Attachments: COMBO-mapreduce-4809-4812-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch,
mapreduce-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch,
MergeManagerPlugin.pdf
>
>
> Now that Shuffle is pluggable (MAPREDUCE-4049), it would be convenient for alternate
implementations to be able to reuse portions of the default implementation. 
> This would come with the strong caveat that these classes are LimitedPrivate and Unstable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message