hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10314) A new tool to sync current HDFS view to specified snapshot
Date Tue, 20 Sep 2016 23:50:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15508167#comment-15508167

Yongjun Zhang commented on HDFS-10314:

Many thanks [~jingzhao] for the review and feedback. Please see my answers below.

So the current patch actually adds a new distsync extension and implements the "calculating
diff on target cluster" approach? 
Yes. This is the result of our discussion in HDFS-9820. 

Though I preferred adding -rdiff as a symmetric behavior as how -diff works in distcp, as
reported in HDFS-9820, I think your suggestion of creating a new tool is fine, as long as
we leverage the code that does -diff in distcp, and minimize code duplication.

I think to have the diff calculated on target is fine, 
Yes. Since the goal is to make the target's state go to a specified snapshot,  we'd better
calculate snapshot diff at the target.

but I'm not sure to directly extend the current distcp is a good idea.
There are couple of reasons when I came up with the idea of extending distcp:
* distsync is a customized distcp,  it extends distcp's -diff behavior to support -rdiff.
* it's better to re-use the code that implements -diff, extending allows re-using the existing
implementation of "-diff". You can see it's only 124 lines of code (including the header and
imports) in DistSync.java in my patch rev001.
Correct me if I'm wrong. Here's my current understanding of the patch:
1. our main motivation is still to utilize distcp to restore a snapshot
2. the idea is to compute the delta on the target cluster, and for modified files we get their
original state from the source.
Yes. However, for modified files, I intended to make it flexible to copy from the specified
snapshot of either the source or the target.

In that sense, I think a simpler way is to wrap (but not extend) the current distcp in the
snapshot-restore tool:
1. The tool takes a single cluster and a target snapshot as arguments
2. The tool computes the delta for restoring using snapshot diff report
3. The tool does rename/delete etc. metadata ops to revert part of the diff
4. The tool uses the distcp (by invokes distcp as a library) to copy the original states of
modified files
In this way we can minimize the change (no need to touch the current distcp implementation/arguments),
and provides a new tool with simple but clear semantic. We may lose some flexibility (only
handling one cluster) but the tool itself will be easy to use and will not cause any confusion
to the end users.
What do you think? Please let me know if I miss anything.

We discussed two overall solutions earlier.

* Solution A. What proposed in HDFS-9820: adding "-rdiff s2 s1" to distcp, to achieve the
symmetric behavior as "-diff s1 s2" of distcp.
* Solution B. What proposed in HDFS-10314: introducing a new tool, that allows to sync a target
cluster to a specified snapshot.

For Solution B,  there are two approaches, one (B.1) is my patch rev001 here, the other (B.2)
is what you proposed above. 

Some thoughts:

# Creating a new tool itself is going to mean extra support, that's why I preferred solution
#A, which is the simplest.
# Given that we want to create a new tool, we'd better maximize code sharing, otherwise, it's
going to be both more development effort and extra support effort. 
# To me, the way suggested by solution #B.2 disallows sharing the existing implementation
of -diff in distcp. Thus I think it's actually not simpler, and would incur support burden
for future because of the duplicated code.
# I think we agreed per our discussion that if we create a new tool, then you don't have strong
opinion whether we copy from a different cluster or from the same target cluster. As I shared
earlier, I can tell from the user's case, that copying from a different mirror cluster can
be much faster sometimes. So I kept suggesting that it would be better to support the flexibility,
to copy from either the source or the target.

Would you please kindly share the specific problems you see with solution #B.1? 

Honestly speaking, I still prefer solution #A. But I'm ok with solution B, except I really
hope to share the code of -diff implemented in distcp already.

Thanks a lot.

> A new tool to sync current HDFS view to specified snapshot
> ----------------------------------------------------------
>                 Key: HDFS-10314
>                 URL: https://issues.apache.org/jira/browse/HDFS-10314
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: tools
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-10314.001.patch
> HDFS-9820 proposed adding -rdiff switch to distcp, as a reversed operation of -diff switch.

> Upon discussion with [~jingzhao], we will introduce a new tool that wraps around distcp
to achieve the same purpose.
> I'm thinking about calling the new tool "rsync", similar to unix/linux command "rsync".
The "r" here means remote.
> The syntax that simulate -rdiff behavior proposed in HDFS-9820 is
> {code}
> rsync <fromSnapshotName>  <toSnapshotName>  <source> <target>
> {code}
> This command ensure <fromSnapshotName>  is newer than <toSnapshotName>.
> I think, In the future, we can add another command to have the functionality of -diff
switch of distcp.
> {code}
> sync <fromSnapshotName>  <toSnapshotName>  <source> <target>
> {code}
> that ensures <fromSnapshotName>  is older than <toSnapshotName>.
> Thanks [~jingzhao].

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message