hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4808) Allow reduce-side merge to be pluggable
Date Sun, 13 Jan 2013 01:21:12 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552113#comment-13552113
] 

Chris Douglas commented on MAPREDUCE-4808:
------------------------------------------

I like many of these changes anyway- the refactoring from tagged unions to a type hierarchy
for {{MapOutput}} is nice- though I'm still not groking why the new API is necessary. The
design doc reads like a summary of the patches and a motivation for pluggability generally,
without directly motivating this particular moving part. [~masokan], is there code from MAPREDUCE-2454
reviewers should look at to better understand why this is required?

While much of the shuffle code can be reused between implementations- as Jerry points out-
a common base for all implementations would be overly-ambitious. To rephrase what Arun has
been saying: adding a "plugin" for tweaking the existing code is not necessary when the whole
{{Shuffle}} can be swapped out after MAPREDUCE-4049.

The only exception I can think of is for {{LocalRunner}}, where the {{Merger}}- not the {{MergeManager}}-
needs to be pluggable. The current patch introduces a dependency on the {{MergeManager}} for
the local case, using only the subset that applies in that context. Without a use case (e.g.,
useful for testing), it's hard to justify the patch's abstraction if all implementations of
{{MergeManager}} are exposed to this special invocation. To be fair, it may be required for
the "ubertask" code to work seamlessly, but that case needs to be made. Even so: the {{Merger}}
seems like the right cut point.

To be concrete, take the RDMA/network merge case as an example. Assume the overhead of a reader
requires one to limit the number of clients/open connections in the reduce. Implementing a
{{Merger}} that keeps a total order on its segments and only keeps the top _k_ remote segments
open would loosely track that resource. But since the {{Fetcher}} is essentially just packaging
the remote segment and translating the TCE into a {{MapOutput}} (i.e., doesn't open a connection),
it's not obvious why a {{MergeManager}} would coordinate resource use between them. Even with
a hybrid scheme- where a set of {{Fetcher}} and {{Merger}} instances pull segments locally
based on some cost function, so they *do* share resources- aren't the changes extensive enough
to justify a separate {{Shuffle}} implementation?

The Merger *does* handle multiple merge passes; intermediate merges of segments are part of
its contract. While you may be referring to _continuous_ merging of segments arriving concurrently
with a merge, again: why would it be preferable to configure a {{MergeManager}} instead of
a wholly new {{Shuffle}} impl?
                
> Allow reduce-side merge to be pluggable
> ---------------------------------------
>
>                 Key: MAPREDUCE-4808
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4808
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 2.0.2-alpha
>            Reporter: Arun C Murthy
>            Assignee: Mariappan Asokan
>             Fix For: 2.0.3-alpha
>
>         Attachments: COMBO-mapreduce-4809-4812-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch,
mapreduce-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch, MergeManagerPlugin.pdf
>
>
> Allow reduce-side merge to be pluggable for MAPREDUCE-2454

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message