hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-591) Reducer sort should even out the pass factors in merging different pass
Date Wed, 11 Oct 2006 16:50:36 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-591?page=comments#action_12441494 ] 
Runping Qi commented on HADOOP-591:

Another strategy is to do one merge per pass. At the end of each pass, replace the just merged
segments with the new segment in the segment list. Always choose the smallest segments from
the segment list to merge. Each pass except the first one merges pass factor number of segments
into one, reducing the segment list by pass factor minus one. Let T be the total number of
segments, P the pass factor, The number of segments for the first pass will be  P if   (T-1)
% (P-1) = 0; and (T-1)%(P-1) + 1 otherwise.
This strategy will minimize the volume of data copied during the merge process.

For the example in the previous comment. the first pass will merge the smallest 2 segments,
and then the second pass merge the new segment with the remaining 99 segments.

> Reducer sort should even out the pass factors in merging different pass
> -----------------------------------------------------------------------
>                 Key: HADOOP-591
>                 URL: http://issues.apache.org/jira/browse/HADOOP-591
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Runping Qi
> When multiple pass merging is needed during sort, the current sort implementation in
SequenceFile class uses a simple "greedy" way to select pass factors, resulting uneven pass
factor in different passes. For example, if the factor pass is 100 (the default), and there
are 101 segments to be merged. The current implementation will first merge the first 100 segments
into one and then merge the big output file with the last segment with pass factor 2. It will
be better off to use pass factors 11 in the first pass and pass factor 10 in the second pass.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message