hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9820) Improve distcp to support efficient restore to an earlier snapshot
Date Thu, 14 Apr 2016 20:41:25 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241868#comment-15241868

Yongjun Zhang commented on HDFS-9820:

Thanks a lot [~jingzhao]!

My thoughts to share:

Let's say we first have snapshot s1 both both source and target (and the source and the target
have been synced). Then we make some changes on the source, do a forward incremental distcp
copy to apply the changes to the target. Based on our assumption, before the next incremental
copy, we will create a snapshot s2 on both the source and the target.
This is HDFS-7535/HDFS-8828. One small correction: before we do the incremental copy, we create
a snapshot s2 on source cluster first,  find snapshot diff between s1 and s2, and apply this
diff to target cluster, then finally create s2 on target cluster.  We assume that no changes
have been made at target cluster after s1 was created before we do incremental copy in this
case (*assumption I*).

2. Do you mean if {{""}} ever appear as one parameter of {{-diff}}, then it's a revert operation,
otherwise it's forward operation?
In theory, we could copy incremental changes from source cluster to destination cluster without
creating a new snapshot (s2 in our example). Say, after s1 is made in source cluster, and
s1 is sync-ed to  target cluster, and s1 is also created in target cluster, we could interpret

{{distcp -diff s1 "" source target}}.

as to incrementally copy changes made after s1 in source cluster to target, right?  Because
{{""}} is just an alias of current state "snapshot", 

I personally feel it's more intuitive to count on the parameter order, and let ({{-diff s1
s2}} mean the forward change from s1 to s2, {{-diff s2 s1}} mean the revert change from s2
to s1. Say, assume a cluster is already at state s2, and we do {{-diff s1 s2}}, it would be
a non-op; If we do {{-diff s2 s1}}, it means to go back to s1. In other words, {{-diff <fromState>
<toState>}} is what I feel more intuitive.
But if this is what you prefer, we can relax the order requirement, and let {{""}} means revert
operation. Would you please confirm? 

And would you please let me know whether my comment #1 in my previous reply makes sense to

3. Not quite follow what you meant by "bypass DistCpSync#prepareDiffList. ". Some more details
would help.

Many thanks.

> Improve distcp to support efficient restore to an earlier snapshot
> ------------------------------------------------------------------
>                 Key: HDFS-9820
>                 URL: https://issues.apache.org/jira/browse/HDFS-9820
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: distcp
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-9820.001.patch, HDFS-9820.002.patch, HDFS-9820.003.patch, HDFS-9820.004.patch
> HDFS-4167 intends to restore HDFS to the most recent snapshot, and there are some complexity
and challenges. 
> HDFS-7535 improved distcp performance by avoiding copying files that changed name since
last backup.
> On top of HDFS-7535, HDFS-8828 improved distcp performance when copying data from source
to target cluster, by only copying changed files since last backup. The way it works is use
snapshot diff to find out all files changed, and copy the changed files only.
> See https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/
> This jira is to propose a variation of HDFS-8828, to find out the files changed in target
cluster since last snapshot sx, and copy these from the source target's same snapshot sx,
to restore target cluster to sx.
> If a file/dir is
> - renamed, rename it back
> - created in target cluster, delete it
> - modified, put it to the copy list
> - run distcp with the copy list, copy from the source cluster's corresponding snapshot
> This could be a new command line switch -rdiff in distcp.
> HDFS-4167 would still be nice to have. It just seems to me that HDFS-9820 would hopefully
be easier to implement.

This message was sent by Atlassian JIRA

View raw message