hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Nauroth <cnaur...@hortonworks.com>
Subject Re: Using DistCp and S3AFileSystem to move data to S3
Date Mon, 16 May 2016 16:58:33 GMT
Hello Elliot,

This is very timely as I have been investigating this recently.  Your assessment is correct:
DistCp triggers a rename, and renames on S3 do not satisfy the expectation that rename is
fast and atomic like on most file systems.

There has been prior discussion of "direct commit" strategies like you described to improve
performance against S3A.  The relevant JIRAs are HADOOP-9565 and HADOOP-11487.  I recommend
watching those JIRAs if you'd like to keep track of how the discussion evolves.

Meanwhile, you might be interested in my work-in-progress patch on HADOOP-13145, which prevents
some unnecessary calls in DistCp when you're not using the option to preserve metadata attributes.
 This does not directly address the rename/copy problem, but it does avoid a potential eventual
consistency problem with DistCp to S3A and provide an overall optimization.  We are seeing
good results with the patch so far from some manual DistCp testing.  I still need to write
some JUnit tests before we'll commit that patch.

--Chris Nauroth

From: Elliot West <teabot@gmail.com<mailto:teabot@gmail.com>>
Date: Monday, May 16, 2016 at 7:11 AM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Subject: Using DistCp and S3AFileSystem to move data to S3


I've been moving files to S3 using DistCp and the S3AFileSystem (branch-2.8) and notice that
DistCp always copies to a temporary set of files in S3 and then performs a move on copy completion.
It does this in a per task basis and is separate from the temporary location of the '-atomic'
option. An example path is as follows:


Now, my understanding is that moves on S3 are actually an asynchronous copy + delete, and
that once the call to FileSystem.rename(...) returns there is no guarantee that the data is
at the destination at that point in time. Therefore I can make no guarantees regarding the
availability of said data to downstream processes that may wish to consume it. However, I
am lead to believe that file creations are consistent (but not overwrites).

Is there any way to have DistCp write directly to the target location in S3? If not, is there
any reason why it would be undesirable to provide the option of such behaviour?

The code in question is located here:



View raw message