Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E74401862E for ; Sun, 6 Dec 2015 19:56:11 +0000 (UTC) Received: (qmail 71956 invoked by uid 500); 6 Dec 2015 19:56:11 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 71911 invoked by uid 500); 6 Dec 2015 19:56:11 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 71886 invoked by uid 99); 6 Dec 2015 19:56:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Dec 2015 19:56:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 54C7C2C1F8D for ; Sun, 6 Dec 2015 19:56:11 +0000 (UTC) Date: Sun, 6 Dec 2015 19:56:11 +0000 (UTC) From: "Yongjun Zhang (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15044096#comment-15044096 ] Yongjun Zhang commented on HADOOP-11794: ---------------------------------------- Hi [~mithun], thanks for your earlier work here. Wonder if you will continue to work on this issue? If not, I'm interested in taking it on. Thanks. > distcp can copy blocks in parallel > ---------------------------------- > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp > Affects Versions: 0.21.0 > Reporter: dhruba borthakur > Assignee: Mithun Radhakrishnan > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are greater than 1 TB with a block size of 1 GB. If we use distcp to copy these files, the tasks either take a long long long time or finally fails. A better way for distcp would be to copy all the source blocks in parallel, and then stich the blocks back to files at the destination via the HDFS Concat API (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.4#6332)