hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Corby Wilson (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-11281) Add flag to fs.shell to skip _COPYING_ file
Date Fri, 07 Nov 2014 19:49:34 GMT
Corby Wilson created HADOOP-11281:
-------------------------------------

             Summary: Add flag to fs.shell to skip _COPYING_ file
                 Key: HADOOP-11281
                 URL: https://issues.apache.org/jira/browse/HADOOP-11281
             Project: Hadoop Common
          Issue Type: Improvement
          Components: fs, fs/s3
         Environment: Hadoop 2.2 but is in all of them.
AWS EMR 3.0.4
            Reporter: Corby Wilson
            Priority: Critical


Amazon S3 does not have a rename feature.
When you use the hadoop shell or distcp feature, hadoop first uploads the file using the ._COPYING_
extension, then renames the file to the final output.

Code:
org/apache/hadoop/fs/shell/CommandWithDestination.java
      PathData tempTarget = target.suffix("._COPYING_");
      targetFs.setWriteChecksum(writeChecksum);
      targetFs.writeStreamToFile(in, tempTarget, lazyPersist);
      targetFs.rename(tempTarget, target);

The problem is that on rename, we actually have to download the file again (through an InputStream),
then upload it again.
For very large files (>= 5GB) we have to use multipart upload.
So if we are processing several TB of multi GB files, we are actually writing the file to
S3 twice and reading it once from S3.

It would be nice to have a flag or core-site.xml setting that allowed us to tell hadoop to
skip the copy and just write the file once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message