hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-10314) Propose a new tool that wraps around distcp to "restore" changes on target cluster
Date Tue, 19 Apr 2016 22:32:25 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248778#comment-15248778
] 

Yongjun Zhang edited comment on HDFS-10314 at 4/19/16 10:31 PM:
----------------------------------------------------------------

The idea is the wrap around distcp as a tool to achieve the functionality of distcp's switch
-rdiff (if we will do the same for -diff, it will be a different jira). Here is a description
and comparison of the -diff and unimplemented -rdiff switches. 

{code}
Definition: Assuming we have two snapshots, s1 and s2, where s1 is created earlier, and s1
is newer.

- SnapshotDiff(s1, s2): represents the delta between s1 and s2; That is, if we apply 
  snapshotDiff(s1, s2)  on top of s1, we can go to the state of s2.
- SnapshotDiff(s2, s1) represents the reversed delta between s1 and s2. That is, if
  we apply SnapshotDiff(s2, s1) on top of s2, we can go back to the state of s1.

Note: When we talk about source and target, we mean distcp source and distcp target.

A. -diff allows distcp to efficiently copy incremental changes made (on top of previously
copied
    snapshot s1) in source cluster to target cluster   Assuming snapshot s2 is created at
the source to
    capture s1 + incremental changes, snapshotDiff(s1,s2) is the incremental changes, the
output of this
    operation is that the target will be at s2 sate. this operation involves three steps:

  A.1 calculate snapshotDiff(s1, s2) at the source
  A.2 apply the rename and delete portion of the snapshotDiff at the target. this step is
called "sync"
  A.3 copy created/modified files from source's s2 to target 

B. -rdiff allows distcp to efficiently copy data from snapshot s1 to overwrite changes made
in target
    after snapshot sx was created in target. Assuming snapshot s2 is created at the target
to capture
    the changes that need to be overwritten, snapshotDiff(s2, s1) is what we want to apply
to target. 
    The output of this operation is that the target is at s1 state. Similar to -diff, but
with differences, 
    this operation involves three steps too:

  B.1 calculate snapshotDiff(s2, s1) at the target,
  B.2 apply the rename and delete portion of the snapshot diff at the target. this step is
called "sync"
  B.3 copy created/modified files from source's s1 to target. (the source here can be a different
        cluster, or the target itself. When it's a different cluster, the cluster has to have
snapshot s1 
        that's has exact same name and content as the s1 at the target)

A tablularized comparison:

                  required snapshots      DiffCalc       Output After Operation
                  --------------------------
                  source        target        
                  ------------------------------------------
-diff             s1, s2   ->  s1             source         target is at s2
-rdiff            s1       ->   s1,s2        target          target is at  s1  

(note, for -rdiff, the source could be the same as target)

So the "r" (reversed) in the -rdiff means the following and is very symmetric to -diff:

- swap the snapshot requirement of source and target in -diff 
  (from "s1, s2   ->   s1 "  to  "s1  ->   s1,s2")
- swap the result snapshot after operation (from s2 to s1)
- swap the snapshot diff calculation place  (from source to target)

We require source and target to have same snapshot s1 (same snapshot name, same content).
{code}



was (Author: yzhangal):
The idea is the wrap around distcp as a tool to achieve the functionality of distcp's switch
-rdiff (if we will do the same for -diff, it will be a different jira). Here is a description
and comparison of the -diff and unimplemented -rdiff switches. 

{code}
Definition: Assuming we have two snapshots, s1 and s2, where s1 is created earlier, and s1
is newer.

- SnapshotDiff(s1, s2): represents the delta between s1 and s2; That is, if we apply 
  snapshotDiff(s1, s2)  on top of s1, we can go to the state of s2.
- SnapshotDiff(s2, s1) represents the reversed delta between s1 and s2. That is, if
  we apply SnapshotDiff(s2, s1) on top of s2, we can go back to the state of s1.

Note: When we talk about source and target, we mean distcp source and distcp target.

A. -diff allows distcp to efficiently copy incremental changes made (on top of previously
copied
    snapshot s1) in source cluster to target cluster   Assuming snapshot s2 is created at
the source to
    capture s1 + incremental changes, snapshotDiff(s1,s2) is the incremental changes, the
output of this
    operation is that the target will be at s2 sate. this operation involves three steps:

  A.1 calculate snapshotDiff(s1, s2) at the source
  A.2 apply the rename and delete portion of the snapshotDiff at the target. this step is
called "sync"
  A.3 copy created/modified files from source's s2 to target 

B. -rdiff allows distcp to efficiently copy data from snapshot s1 to overwrite changes made
in target
    after snapshot sx was created in target. Assuming snapshot s2 is created at the target
to capture
    the changes that need to be overwritten, snapshotDiff(s2, s1) is what we want to apply
to target. 
    The output of this operation is that the target is at s1 state. Similar to -diff, but
with differences, 
    this operation involves three steps too:

  B.1 calculate snapshotDiff(s2, s1) at the target,
  B.2 apply the rename and delete portion of the snapshot diff at the target. this step is
called "sync"
  B.3 copy created/modified files from source's s1 to target. (the source here can be a different
        cluster, or the target itself. When it's a different cluster, the cluster has to have
snapshot s1 
        that's has exact same name and content as the s1 at the target)

A tablularized comparison:

                  required snapshots      DiffCalc       Output After Operation
                  --------------------------
                  source        target        
                  ------------------------------------------
-diff             s1, s2   ->  s1             source         target is at s2
-rdiff            s1       ->   s1,s2        target          target is at  s1  

(note, for -rdiff, the source could be the same as target)

So the "r" (reversed) in the -rdiff means the following:

- swap the snapshot requirement of source and target in -diff 
  (from "s1, s2   ->   s1 "  to  "s1  ->   s1,s2")
- swap the result snapshot after operation (from s2 to s1)
- swap the snapshot diff calculation place  (from source to target)

We require source and target to have same snapshot s1 (same snapshot name, same content).
{code}


> Propose a new tool that wraps around distcp to "restore" changes on target cluster
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-10314
>                 URL: https://issues.apache.org/jira/browse/HDFS-10314
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: tools
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>
> HDFS-9820 proposed adding -rdiff switch to distcp, as a reversed operation of -diff switch.

> Upon discussion with [~jingzhao], we will introduce a new tool that wraps around distcp
to achieve the same purpose.
> I'm thinking about calling the new tool "rsync", similar to unix/linux command "rsync".
The "r" here means remote.
> The syntax that simulate -rdiff behavior proposed in HDFS-9820 is
> {code}
> rsync <fromSnapshotName>  <toSnapshotName>  <source> <target>
> {code}
> This command ensure <fromSnapshotName>  is newer than <toSnapshotName>.
> I think, In the future, we can add another command to have the functionality of -diff
switch of distcp.
> {code}
> sync <fromSnapshotName>  <toSnapshotName>  <source> <target>
> {code}
> that ensures <fromSnapshotName>  is older than <toSnapshotName>.
> Thanks [~jingzhao].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message