hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9820) Improve distcp to support efficient restore to an earlier snapshot
Date Thu, 14 Apr 2016 23:47:25 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242150#comment-15242150

Yongjun Zhang commented on HDFS-9820:

Hi [~jingzhao],

Thanks for proposing offline discussion, I was thinking about the same:-) Just shared contact

Because of the similarity to HDFS-7535/HDFS-8828, the change indeed can be small (I have tried).
 In latest patch, some changes tries to address the in-symmetric output (HDFS-10263) by always
going with forward snapshot diff; some other changes are intended to reorg the code for better

For completeness' sake, if you could comment back to the comments I made in my prior update,
it would be appreciated.


> Improve distcp to support efficient restore to an earlier snapshot
> ------------------------------------------------------------------
>                 Key: HDFS-9820
>                 URL: https://issues.apache.org/jira/browse/HDFS-9820
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: distcp
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-9820.001.patch, HDFS-9820.002.patch, HDFS-9820.003.patch, HDFS-9820.004.patch
> HDFS-4167 intends to restore HDFS to the most recent snapshot, and there are some complexity
and challenges. 
> HDFS-7535 improved distcp performance by avoiding copying files that changed name since
last backup.
> On top of HDFS-7535, HDFS-8828 improved distcp performance when copying data from source
to target cluster, by only copying changed files since last backup. The way it works is use
snapshot diff to find out all files changed, and copy the changed files only.
> See https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/
> This jira is to propose a variation of HDFS-8828, to find out the files changed in target
cluster since last snapshot sx, and copy these from the source target's same snapshot sx,
to restore target cluster to sx.
> If a file/dir is
> - renamed, rename it back
> - created in target cluster, delete it
> - modified, put it to the copy list
> - run distcp with the copy list, copy from the source cluster's corresponding snapshot
> This could be a new command line switch -rdiff in distcp.
> HDFS-4167 would still be nice to have. It just seems to me that HDFS-9820 would hopefully
be easier to implement.

This message was sent by Atlassian JIRA

View raw message