Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-issues@hadoop.apache.org
Date: Fri, 30 Jan 2015 19:06:37 +0000 (UTC)
From: "Chris Nauroth (JIRA)" <jira@apache.org>
To: common-issues@hadoop.apache.org
Message-ID: <JIRA.12753788.1415389733000.217270.1422644797610@Atlassian.JIRA>
In-Reply-To: <JIRA.12753788.1415389733000@Atlassian.JIRA>
References: <JIRA.12753788.1415389733000@Atlassian.JIRA>
 <JIRA.12753788.1415389733254@arcas>
Subject: [jira] [Resolved] (HADOOP-11281) Add flag to fs.shell to skip
 _COPYING_ file
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HADOOP-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Nauroth resolved HADOOP-11281.
------------------------------------
    Resolution: Duplicate

> Add flag to fs.shell to skip _COPYING_ file
> -------------------------------------------
>
>                 Key: HADOOP-11281
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11281
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs, fs/s3
>         Environment: Hadoop 2.2 but is in all of them.
> AWS EMR 3.0.4
>            Reporter: Corby Wilson
>            Priority: Critical
>
> Amazon S3 does not have a rename feature.
> When you use the hadoop shell or distcp feature, hadoop first uploads the file using the ._COPYING_ extension, then renames the file to the final output.
> Code:
> org/apache/hadoop/fs/shell/CommandWithDestination.java
>       PathData tempTarget = target.suffix("._COPYING_");
>       targetFs.setWriteChecksum(writeChecksum);
>       targetFs.writeStreamToFile(in, tempTarget, lazyPersist);
>       targetFs.rename(tempTarget, target);
> The problem is that on rename, we actually have to download the file again (through an InputStream), then upload it again.
> For very large files (>= 5GB) we have to use multipart upload.
> So if we are processing several TB of multi GB files, we are actually writing the file to S3 twice and reading it once from S3.
> It would be nice to have a flag or core-site.xml setting that allowed us to tell hadoop to skip the copy and just write the file once.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)