hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-7535) Utilize Snapshot diff report for distcp
Date Tue, 16 Dec 2014 20:47:14 GMT

     [ https://issues.apache.org/jira/browse/HDFS-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jing Zhao updated HDFS-7535:
    Attachment: HDFS-7535.000.patch

Upload a very initial patch to demo the functionality. The patch adds new options in the current
distcp tool to indicate the from/to snapshot names (assuming the same snapshot names are used
in primary and backup cluster). And before doing the real data copy, the patch identifies
the delete/rename operations based on the diff report, and calls the same ops in the backup
cluster so that we can avoid unnecessary data copy caused by rename.

Still need more work to make the whole process more efficient. Also need to add tests and
handle corner cases.

> Utilize Snapshot diff report for distcp
> ---------------------------------------
>                 Key: HDFS-7535
>                 URL: https://issues.apache.org/jira/browse/HDFS-7535
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>         Attachments: HDFS-7535.000.patch
> Currently HDFS snapshot diff report can identify file/directory creation, deletion, rename
and modification under a snapshottable directory. We can use the diff report for distcp between
the primary cluster and a backup cluster to avoid unnecessary data copy. This is especially
useful when there is a big directory rename happening in the primary cluster: the current
distcp cannot detect the rename op thus this rename usually leads to large amounts of real
data copy.
> More details of the approach will come in the first comment.

This message was sent by Atlassian JIRA

View raw message