hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mariappan Asokan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2454) Allow external sorter plugin for MR
Date Sun, 18 Nov 2012 12:46:58 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13499789#comment-13499789
] 

Mariappan Asokan commented on MAPREDUCE-2454:
---------------------------------------------

Hi Arun,
  I would like to make the following points:

* We talked about different processing that can happen before the {{Reducer.}}  Currently,
we have a *merge*.  It can be a *sort* as you mentioned or a simple *copy* as well.  The *copy*
case arises when one wants to avoid sorting that happens in the MR data flow.  It would enable
hash based aggregation or join in the {{Reducer.}}
* Regardless of the processing done or whether shuffle is push or pull based, the processing
should be in control of driving the processing not the shuffle.  This is not obvious for a
*sort* or *merge*.  For a *copy*, it makes a big difference.
* For a *copy*, we want the {{Reducer}} to receive the <key, value> pairs as soon as
data is shuffled(unlike *sort* or *merge* which has to wait until the last <key, value>
pair is seen before outputting the first <key, value> pair.)  There is no need to spill
data to disk on the reduce side.
* With the current arrangement where shuffle assumes that the processing(*merge*) can return
a {{RawKeyValueIterator}} only at the end of shuffling, it is impossible to support *copy*.
 There is inherent deadlock because *copy* wants to return the <key, value> pairs right
away whereas shuffle
thinks that it can happen only at the end.
* The change I made is very simple.  It does not alter any semantics and it allows the processing
to be a *copy* without any deadlock.  In fact, the test I created as part of this Jira does
a simple *copy* before the {{Reducer.}}

I hope I clarified the reason for the change.

Thanks.

-- Asokan

                
> Allow external sorter plugin for MR
> -----------------------------------
>
>                 Key: MAPREDUCE-2454
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2454
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 2.0.0-alpha, 3.0.0, 2.0.2-alpha
>            Reporter: Mariappan Asokan
>            Assignee: Mariappan Asokan
>            Priority: Minor
>              Labels: features, performance, plugin, sort
>         Attachments: HadoopSortPlugin.pdf, HadoopSortPlugin.pdf, KeyValueIterator.java,
MapOutputSorterAbstract.java, MapOutputSorter.java, mapreduce-2454-modified-code.patch, mapreduce-2454-modified-test.patch,
mapreduce-2454-new-test.patch, mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch,
mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch,
mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch,
mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch,
mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454.patch, mapreduce-2454-protection-change.patch,
mr-2454-on-mr-279-build82.patch.gz, MR-2454-trunkPatchPreview.gz, ReduceInputSorter.java
>
>
> Define interfaces and some abstract classes in the Hadoop framework to facilitate external
sorter plugins both on the Map and Reduce sides.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message