hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravi Prakash (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3685) There are some bugs in implementation of MergeManager
Date Fri, 08 Mar 2013 23:35:14 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13597694#comment-13597694

Ravi Prakash commented on MAPREDUCE-3685:

Hi Mariappan!

Thanks a lot for your review and comments. I'm sure this is not the last JIRA to go into MergeManagerImpl.
:-) If we spot something definitely lets open a new one.

* Here's my understanding of how the Merge works. Lets assume all the map outputs are on disk.
Lets also say that io.sort.factor is set to X. Since we don't want to do any more merges than
are necessary, we try to ensure that in the last *final* merge, there will be X streams to
merge. This means that as the reducer starts fetching map outputs, we wait until there are
at least 2X-1 map outputs (We don't know how many map outputs we will really get because some
maps may not have produced any output). When the number goes over 2X-1, we can be sure that
we need an intermediate merge of X streams. This leaves X-1 in onDiskMapOutputs. The X streams
are merged into 1. After this merge, together we now have (X-1) + 1 = X streams. When the
number of streams > X and < 2X-1, we let the code go to finalMerge, which in itself
eventually calls {code:title=Merger.java:645|borderStyle=solid}
if (numSegments <= factor) {
....No extra merge needed
} else {
....Do a merge of (number of map outputs) - X
{code} So from my understanding it seems 2X-1 is the correct number. Please let me know if
you still think its not.

* Hmmm.. I didn't know you could give a TreeSet a partial ordering and still get sorted output.
The latest javadocs don't say anything, but I found a StackOverflow saying it used to be the
case in JDK1.2. Do you know if that is still true?

* Unfortunately I hadn't run any performance tests. Hopefully we will get the fix on our clusters
soon and if we see incredible improvements in performance, I'll try to remember to report
back here. This should probably help on large jobs with a lot of maps and we do have quite
a few of those :-)
> There are some bugs in implementation of MergeManager
> -----------------------------------------------------
>                 Key: MAPREDUCE-3685
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3685
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: anty.rao
>            Assignee: anty
>            Priority: Critical
>             Fix For: 0.23.7, 2.0.5-beta
>         Attachments: MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685-branch-0.23.1.patch,
MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685.branch-0.23.patch, MAPREDUCE-3685.branch-0.23.patch,
MAPREDUCE-3685.branch-0.23.patch, MAPREDUCE-3685.branch-0.23.patch, MAPREDUCE-3685.branch-0.23.patch,
MAPREDUCE-3685.patch, MAPREDUCE-3685.patch, MAPREDUCE-3685.patch, MAPREDUCE-3685.patch, MAPREDUCE-3685.patch,
MAPREDUCE-3685.patch, MAPREDUCE-3685.patch

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message