hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HADOOP-13975) Allow DistCp to use MultiThreadedMapper
Date Wed, 11 Jan 2017 23:26:16 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15819514#comment-15819514
] 

Zheng Shao edited comment on HADOOP-13975 at 1/11/17 11:26 PM:
---------------------------------------------------------------

Example usage:

bin/hadoop distcp -Dmapreduce.job.user.classpath.first=true -prbugp -m 8 -numThreadsPerMap
16 hdfs://sourcehdfs/srcdir hdfs://targethdfs/

This uses 8 mapper, each of which with 16 threads, to do distcp.  It's equivalent of running
128 mappers except this is more efficient (as long as we don't hit the resource bottleneck
on a single machine).

Note that "Dmapreduce.job.user.classpath.first=true" is needed if you only update the client-side
hadoop-tools jar but not the server (YARN nodemanager) side yet.



was (Author: zshao):
Example usage:

bin/hadoop distcp -Dmapreduce.job.user.classpath.first=true -prbugp -m 2 -numThreadsPerMap
16 hdfs://sourcehdfs/srcdir hdfs://targethdfs/

Note that "Dmapreduce.job.user.classpath.first=true" is needed if you only update the client-side
hadoop-tools jar but not the server (YARN nodemanager) side yet.


> Allow DistCp to use MultiThreadedMapper
> ---------------------------------------
>
>                 Key: HADOOP-13975
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13975
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: tools/distcp
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Minor
>         Attachments: HADOOP-distcp-multithreaded-mapper-branch26.1.patch, HADOOP-distcp-multithreaded-mapper-trunk.1.patch
>
>
> Although distcp allow users to control the parallelism via number of mappers, sometimes
it's desirable to run fewer mappers but more threads per mapper.  Since distcp is network
bound (either by throughput or more frequently by latency of creating connections, opening
files, reading/writing files, and closing files), this can make each mapper much more efficient.
> In that way, a lot of resources can be shared so we can save memory and connections to
NameNode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message