hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-2920) Optimize the last merge of the map output files
Date Sat, 01 Mar 2008 09:25:51 GMT
Optimize the last merge of the map output files
-----------------------------------------------

                 Key: HADOOP-2920
                 URL: https://issues.apache.org/jira/browse/HADOOP-2920
             Project: Hadoop Core
          Issue Type: Improvement
          Components: mapred
            Reporter: Devaraj Das


In ReduceTask, today we do merges of io.sort.factor number of files everytime we merge and
write the result back to disk. The last merge can probably be better. For example, if there
are io.sort.factor + 10 files at the end, today we will merge 100 files into one and then
return an iterator over the remaining 11 files. This can be improved (in terms of disk I/O)
to merge the smallest 11 files and then return an iterator over the 100 remaining files. Other
option is to not do any single level merge when we have io.sort.factor + n files remaining
(where n << io.sort.factor) but just return the iterator directly. Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message