hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-7535) Utilize Snapshot diff report for distcp
Date Thu, 26 Feb 2015 21:28:05 GMT

     [ https://issues.apache.org/jira/browse/HDFS-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jing Zhao updated HDFS-7535:
----------------------------
    Attachment: HDFS-7535.003.patch

Thanks for the review, Nicholas! Update the patch to address your comments.

bq. In DistCpSync.moveToTmpDir, why move the paths to tmp for the delete operations?

So I'm thinking about if we can support "undo" for this functionality in the future. I.e.,
if the user hits any issue while applying the diff, if we move all the files/dirs to the tmp
dir, we can still have a chance to undo all the changes.

bq. Would it be able to preserve other attributes for the "-p" option?

The attributes preservation will be covered later in the CopyMapper, which calls {{DistCpUtils#preserve}}.
I will do some system tests and maybe add a new unit test to verify.

bq. Is it better to throw an exception instead since the user may not want to fallback?

My current concern is that if this functionality is used by applications like Falcon and Oozie,
it may be more convenient if we can include the fallback logic inside of the distcp. If we
directly throw exceptions then these applications need to have the capability to change the
options to avoid using snapshot diff.

> Utilize Snapshot diff report for distcp
> ---------------------------------------
>
>                 Key: HDFS-7535
>                 URL: https://issues.apache.org/jira/browse/HDFS-7535
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: distcp, snapshots
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>         Attachments: HDFS-7535.000.patch, HDFS-7535.001.patch, HDFS-7535.002.patch, HDFS-7535.003.patch
>
>
> Currently HDFS snapshot diff report can identify file/directory creation, deletion, rename
and modification under a snapshottable directory. We can use the diff report for distcp between
the primary cluster and a backup cluster to avoid unnecessary data copy. This is especially
useful when there is a big directory rename happening in the primary cluster: the current
distcp cannot detect the rename op thus this rename usually leads to large amounts of real
data copy.
> More details of the approach will come in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message