hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-9820) Improve distcp to support efficient restore
Date Wed, 17 Feb 2016 15:33:18 GMT
Yongjun Zhang created HDFS-9820:
-----------------------------------

             Summary: Improve distcp to support efficient restore
                 Key: HDFS-9820
                 URL: https://issues.apache.org/jira/browse/HDFS-9820
             Project: Hadoop HDFS
          Issue Type: New Feature
          Components: distcp
            Reporter: Yongjun Zhang
            Assignee: Yongjun Zhang


HDFS-4167 intends to restore HDFS to the most recent snapshot, and there are some complexity
and challenges. 

HDFS-7535 improved distcp performance by avoiding copying files that changed name since last
backup.

On top of HDFS-7535, HDFS-8828 improved distcp performance when copying data from source to
target cluster, by only copying changed files since last backup. The way it works is use snapshot
diff to find out all files changed, and copy the changed files only.

See https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/

This jira is to propose a variation of HDFS-8828, to find out the files changed in target
cluster since last snapshot sx, and copy these from the source target's same snapshot sx,
to restore target cluster to sx.

If a file/dir is

- renamed, rename it back
- created in target cluster, delete it
- modified, put it to the copy list
- run distcp with the copy list, copy from the source cluster's corresponding snapshot

This could be a new command line switch -rdiff in distcp.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message