hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7535) Utilize Snapshot diff report for distcp
Date Tue, 16 Dec 2014 20:44:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248896#comment-14248896
] 

Jing Zhao commented on HDFS-7535:
---------------------------------

A typical scenario using snapshot for distcp can be like this: every time we start distcp
between the primary cluster and the backup cluster, a snapshot is first created in the primary
cluster. Then the snapshot diff report is computed between the latest snapshot and the snapshot
created for the last distcp. This snapshot diff report represents the delta that should be
applied to the backup cluster. For changes like deletion and rename we can directly apply
the same operations (following some specific order based on their dependency) in the backup
cluster. For changes like creation, append, and other metadata modification we keep using
the functionality of the current distcp. In this approach, we can avoid unnecessary data copy
and also guarantee the source data is immutable since our snapshot is read-only.

We plan to use this jira to provide the basic functionalities in the above approach. More
specifically, we can first add extra options to the current distcp tool so that it can compute
the dalta based on the diff report of two given snapshot names. How to manage snapshots in
the source/target clusters can be done in separate jiras or through separate tools.

> Utilize Snapshot diff report for distcp
> ---------------------------------------
>
>                 Key: HDFS-7535
>                 URL: https://issues.apache.org/jira/browse/HDFS-7535
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>
> Currently HDFS snapshot diff report can identify file/directory creation, deletion, rename
and modification under a snapshottable directory. We can use the diff report for distcp between
the primary cluster and a backup cluster to avoid unnecessary data copy. This is especially
useful when there is a big directory rename happening in the primary cluster: the current
distcp cannot detect the rename op thus this rename usually leads to large amounts of real
data copy.
> More details of the approach will come in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message