hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAPREDUCE-1059) distcp can generate uneven map task assignments
Date Fri, 18 Dec 2009 03:26:18 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Aaron Kimball updated MAPREDUCE-1059:

    Attachment: MAPREDUCE-1059.3.patch

I realized that the problem is much simpler than I was making it out to be. The split points
inserted into the existing file are simply done at points which have nothing to do with the
user's chosen target map task size. It always uses the hardcoded constant MAX_BYTES_PER_MAP.
This patch changes this to query the configuration for distcp.bytes.per.map.

To verify this works: run a distcp job transferring 100 files of 1 KB: {{hadoop distcp -Ddistcp.bytes.per.map=100
srcpath dstpath}}.

On a pseudo-distributed cluster, Hadoop will generate 20 tasks (the task max of 20 tasks/node
will limit the target number of tasks). Then ten of these tasks will transfer files, ten will
transfer none because their splits contained no sync points. With this patch, all 20 tasks
will transfer 5 files each.

> distcp can generate uneven map task assignments
> -----------------------------------------------
>                 Key: MAPREDUCE-1059
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1059
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1059.2.patch, MAPREDUCE-1059.3.patch, MAPREDUCE-1059.patch
> distcp writes out a SequenceFile containing the source files to transfer, and their sizes.
Map tasks are created over spans of this file, representing files which each mapper should
transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform
the bulk of the work.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message