hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mariappan Asokan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3685) There are some bugs in implementation of MergeManager
Date Sat, 09 Mar 2013 05:23:15 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13597848#comment-13597848

Mariappan Asokan commented on MAPREDUCE-3685:

Hi Ravi,
  You are absolutely right that we want to make sure that in the final merge the number of
streams to merge is less than or equal to {{ioSortFactor.}}  I stand corrected on that.  So
I should have stated that the change is:
    if (onDiskMapOutputs.size() > ioSortFactor) {
Note that I changed ">=" to ">" from my previous suggestion.  The rationale for starting
the merge early(instead of waiting until 2*{{ioSortFactor}} - 1 disk files are created) is
to leverage additional overlapped processing.  The last merge will end up with merging less
than or equal to {{ioSortFactor}} disk files.

For example, suppose {{io.sort.factor}} is set to 100 and there are 198 disk files.  With
the code in your patch, all 198 files will be merged in the final merge with two sequential
merge passes one with 100 and the other with 99.  With my suggestion, the first 100 would
have been merged overlapped with the fetches and in-memory merges.  The last merge will merge
only 99 files.

By partial ordering, I did not mean unsorted order.  I meant that duplicates will be retained.
Perhaps, I confused you with a mathematical term.  Simply put, in a partially ordered set,
all elements
are related by "<=".  When you compare two elements in the set, you want to make sure that
the smaller element is collated first.  If one of them is greater than or equal to the other,
it is collated second.  The Java {{TreeSet}} implementation will keep the elements in sorted
order with this
{{Comparable}} implementation.  BTW, what you have in the patch will work.  What I suggested
has less code with no unnecessary comparisons.  For brevity, you can even code it as:
return((this.getCompressedSize() < compPath.getCompressedSize()) ? -1 : 1);

IMHO, any performance enhancement should not result in performance regression for any use
case.  I just wanted to make sure that your patch does not cause any performance regression
in some cases.  If you have some setup to run performance tests, please go ahead.  Otherwise,
just ignore my suggestion.

-- Asokan

> There are some bugs in implementation of MergeManager
> -----------------------------------------------------
>                 Key: MAPREDUCE-3685
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3685
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: anty.rao
>            Assignee: anty
>            Priority: Critical
>             Fix For: 0.23.7, 2.0.5-beta
>         Attachments: MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685-branch-0.23.1.patch,
MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685.branch-0.23.patch, MAPREDUCE-3685.branch-0.23.patch,
MAPREDUCE-3685.branch-0.23.patch, MAPREDUCE-3685.branch-0.23.patch, MAPREDUCE-3685.branch-0.23.patch,
MAPREDUCE-3685.patch, MAPREDUCE-3685.patch, MAPREDUCE-3685.patch, MAPREDUCE-3685.patch, MAPREDUCE-3685.patch,
MAPREDUCE-3685.patch, MAPREDUCE-3685.patch

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message