hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sahil Takiar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6713) Distcp doesn't provide any option to override the default staging directory
Date Thu, 29 Sep 2016 16:07:20 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15533190#comment-15533190
] 

Sahil Takiar commented on MAPREDUCE-6713:
-----------------------------------------

Hey [~kamrul], are you still working on this issue?

Interested in finding a way to avoid using the .distcp.tmp file because if the target filesystem
is S3, then using a .distcp.tmp file involves doing a rename from the tmp file to the final
output name. Since renames on S3 require re-writing the file, this can cause a big performance
hit for Distcp on S3.

> Distcp doesn't provide any option to override the default staging directory
> ---------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6713
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6713
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>    Affects Versions: 2.5.1
>            Reporter: Mohammad Kamrul Islam
>            Assignee: Mohammad Kamrul Islam
>
> *Current state and shortcoming*
> =======================
> By default, distcp writes temporary files into $TARGET_PATH/.distcp.tmp/$taskatttempttid.
(See RetriableFileCopyCommand#getTmpFile). There is no way a user can override this staging/tmp
directory. The problem is obvious in S3 with versioning. For example, user wants to turn on
S3 versioning only for his target directory but not the staging/tmp directory. Current distcp
also creates versioning for staging directory which can contain a lot of temporary files.
If user can override this path by a non-versioned S3 path for staging, it will make things
cleaner.
>   
> *Proposed solution*
> ==============
> Provide a new option(-stage) where user can optionally provide a path from target FS.
Distcp mapper tasks will write distcp temporary files into that directory. 
> *Possible Confusions* 
> =================
> There is another distcp option (-tmp) which can be assumed to serve the same purpose.
But this option works only with "-atomic" option which has a different meaning of temporary
files.
> Another confusion could be the staging directory used by mapreduce framework. The proposed
temp directory is for distcp specific.
> Working on a patch to upload.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org


Mime
View raw message