hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elliot West <tea...@gmail.com>
Subject Using DistCp and S3AFileSystem to move data to S3
Date Mon, 16 May 2016 14:11:07 GMT
Hello,

I've been moving files to S3 using DistCp and the S3AFileSystem
(branch-2.8) and notice that DistCp always copies to a temporary set of
files in S3 and then performs a move on copy completion. It does this in a
per task basis and is separate from the temporary location of the '-atomic'
option. An example path is as follows:

s3://bucket/folder/.distcp.tmp.attempt_0000000000001_000001_m_000001_0


Now, my understanding is that moves on S3 are actually an asynchronous copy
+ delete, and that once the call to FileSystem.rename(...) returns there is
no guarantee that the data is at the destination at that point in time.
Therefore I can make no guarantees regarding the availability of said data
to downstream processes that may wish to consume it. However, I am lead to
believe that file creations are consistent (but not overwrites).

Is there any way to have DistCp write directly to the target location in
S3? If not, is there any reason why it would be undesirable to provide the
option of such behaviour?

The code in question is located here:
https://github.com/apache/hadoop/blob/branch-2.8/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java#L106-L136

Thanks,

Elliot.

Mime
View raw message