hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1059) distcp can generate uneven map task assignments
Date Mon, 05 Oct 2009 23:51:31 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762452#action_12762452

Aaron Kimball commented on MAPREDUCE-1059:

The situation is demonstrated most effectively when a large number of small files are to be
transferred, interspersed with a few larger files. SequenceFile sync points are not distributed
through the transfer list file in the correct locations. Consequently, many mappers do not
have any work to do, as their transfer lists are subsumed by adjacent map tasks in the file.

This patch causes the transfer list file to be rewritten as part of the splitting process,
with sync() points inserted at the correct boundaries between map tasks.

This patch also includes tests of the split algorithm which apply the splitting process to
several different sets of file sizes. The old process can generate lists of splits in which
up to 75% of the mappers do no work. The new process guarantees a better distribution of files
over mappers.

> distcp can generate uneven map task assignments
> -----------------------------------------------
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.patch
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes.
Map tasks are created over spans of this file, representing files which each mapper should
transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform
the bulk of the work.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message