hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "anty.rao (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3685) There are some bugs in MergeManager.java :
Date Wed, 18 Jan 2012 06:38:39 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188304#comment-13188304
] 

anty.rao commented on MAPREDUCE-3685:
-------------------------------------

There are some bug in MergeManager.java
1)  in the constructor of OnDiskMerger in MergeManager.java on line 472,
[code]
   public OnDiskMerger(MergeManager<K, V> manager) {
     super(manager, Integer.MAX_VALUE, exceptionReporter);
     setName("OnDiskMerger - Thread to merge on-disk map-outputs");
     setDaemon(true);
   }
[code]
the second parameter mergeFactor in constructor can't be Integer.MAX_VALUE, it should be io.sort.factor.
if set mergeFactor to Integer.MAX_VALUE
the OnDiskMerger will merge all files feed to it at a time , don't show respect to io.sort.factor
parameter.

2) still in MergeManager.java on line 90, the data structure
  [code]Set<Path> onDiskMapOutputs = new TreeSet<Path>();[code]
is incorrect, you should sort the on disk files by file length, not the uri of Path.
So the files feed to OnDiskMerger is not sorted by length, hurt the overall performance.

3) still in MergeManager.java being from line 745
[code]
      if (0 != onDiskBytes) {
      final int numInMemSegments = memDiskSegments.size();
      diskSegments.addAll(0, memDiskSegments);
      memDiskSegments.clear();
      // Pass mergePhase only if there is a going to be intermediate
      // merges. See comment where mergePhaseFinished is being set
      Progress thisPhase = (mergePhaseFinished) ? null : mergePhase;
      RawKeyValueIterator diskMerge = Merger.merge(
          job, fs, keyClass, valueClass, diskSegments,
          ioSortFactor, numInMemSegments, tmpDir, comparator,
          reporter, false, spilledRecordsCounter, null, thisPhase);
      diskSegments.clear();
      if (0 == finalSegments.size()) {
        return diskMerge;
      }
      finalSegments.add(new Segment<K,V>(
            new RawKVIteratorReader(diskMerge, onDiskBytes), true));
    }
[code]
the above bold code which will merge files down to io.sort.factor, maybe have intermediate
merge process .
What's wrong with that you didn't pass in the codec parameter, the intermediate merge process
will not compress
the written file on disk, leading to huge performance degrade.
                
> There are some bugs in MergeManager.java :
> ------------------------------------------
>
>                 Key: MAPREDUCE-3685
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3685
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: anty.rao
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message