hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravi Gummadi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5572) The map progress value should have a separate phase for doing the final sort.
Date Wed, 15 Apr 2009 14:51:14 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699224#action_12699224

Ravi Gummadi commented on HADOOP-5572:

We are planning to allocate 33% of map task's progress to final sort.

Since merge progress is not updated currently(both map side and reduce side), even if we allocate
33% of mapTask progress to sort(merge), map progress will be stuck at 66.7% till sort(merge)
is finished and progress will jump from 66.7% to 100%. This could affect speculative execution.

Here is a proposal for updating sort/merge progress approximately.

In merge(), we consider the smallest io.sort.factor files for each merge. So we assume that
there is no combiner and we calculate the denominator for mergeProgress using the following
before the begining of merges:

We maintain a list of sizes of segments to be merged(sorted list). We add the sizes of smallest
factor segments(that are going be merged first) and add the sum to the list and remove the
smallest factor sizes. Do this again and again until we are left with 1 element in the list.
This element is the denominator for mergeProgress for 1st merge. 
As and when the segments are read for a merge, the numerator is incremented based on position
in the segment and mergeProgress is updated.
Denominator is decreased by the difference (inputRecordsForThisMerge - mergedRecordsInThisMerge).
This is to get better approximation of mergeProgress with combiner being called in merges.

mergeProgress is not very accurate(when combiner is used in merges) in the above approach
because of 2 reasons:
(1) Exact estimation of total size of data(going to be merged in all the merges) before merges
is not possible when combiner is there.
(2) sizes of compressed and uncompressed segments(inMemory segments) are treated alike.

This would also avoid jump of reduce task progress from 33.3% to 66.7%. On reduce side, for
mergeProgress, we will have to avoid adding the sizes of segments of last merge of factor
segments in estimating the total size of data that will be merged(computation of denominator
from the list of sizes of segments), because the last merge is considered as part of the 3rd
phase of reduce task(i.e. reduce phase).

Thoughts ?

> The map progress value should have a separate phase for doing the final sort.
> -----------------------------------------------------------------------------
>                 Key: HADOOP-5572
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5572
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Ravi Gummadi
> Currently, the final spill and sort doesn't record any progress while it runs, leading
to the perception that the map is done, but "stuck".

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message