hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-6840) Distcp to support cutoff time
Date Fri, 03 Feb 2017 01:27:51 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-6840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Zheng Shao updated MAPREDUCE-6840:
    Attachment: MAPREDUCE-6840.1.patch

> Distcp to support cutoff time
> -----------------------------
>                 Key: MAPREDUCE-6840
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6840
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 2.6.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Minor
>         Attachments: MAPREDUCE-6840.1.patch
> To ensure consistency in the datasets on HDFS,  some projects like file formats on Hive
do HDFS operations in a particular order.  For example, if a file format uses an index file,
a new version of the index file will only be written to HDFS after all files mentioned by
the index are written to HDFS.
> When we do distcp, it's important to preserve that consistency, so that we don't break
those file formats.
> A typical solution for that is to create a HDFS Snapshot beforehand, and only distcp
the Snapshot.  That could work well if the user has superuser privilege to make the directory
> If not, then it will be beneficial to have a cutoff time for distcp, so that distcp only
copy files modified on/before that cutoff time.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org

View raw message