hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-8828) Utilize Snapshot diff report to build copy list in distcp
Date Sun, 02 Aug 2015 07:38:04 GMT

    [ https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14650627#comment-14650627

Yongjun Zhang commented on HDFS-8828:

Hi [~yufeigu],

Thanks for reporting the issue and the patch! It would be a very good performance improvement
for distcp if we can skip files copied previously.

I browsed the patch and have some comments:

# Would you please add java doc to the new methods you introduced including the test area
(to describe what the test does)?
# There is coding guideline of 80 chars per line. I saw some in your change that exceeded
# With the {{type}} field introduced, constructor {{DiffInfo(Path source, Path target)}} would
leave type undefined. sounds this constructor should be totally replaced with {{DiffInfo(Path
source, Path target, SnapshotDiffReport.DiffType type)}}. 
# javadoc is needed here: {{static DiffInfo[] getDiffsForListBuilding(SnapshotDiffReport report)}}
check RENAME but doesn't check DELETE operation; DistCpSync#sync checks RENAME and DELETE
operation. I can see that these two methods work together: deleted files are handled by the
sync method only; but renamed file may be modfied too, thus need to be handled by both methods.
# Before we see it fully functional, it might be a good idea to introduce a new distcp switch
to enable this improvement. When disabled, it behave like before. However, the new switch
might make user interface and internal implemenatation more complex. If everything works nicely,
we might not need the new switch. We can address this comment later.
# Add space between ")" and "throws" in {{private Path getFullPath(Path path)throws IOException}}
# Generally, add a space between "//" and the text following it
# {{private Path getFullPath(Path path)throws IOException}} is not clear to me: 
## why it need a loop if it just need to use the last entry in {{inputs}}?
## {{path = root; return path}} can be simplified as {{return root}} 
## is {{root}} the right variable name? The method is {{getFullPath}}, seems not an accurate
name. I guess you meant {{getFullPathOfRoot}}?
# The following comment is not very clear to me:
  // if two paths are the same, just modify the target path of
  // "MODIFY" actions, ignore "CREATE" action.
Suggest to try to make it more clear. 
# When the patch gets to a shape, we need to update distcp document to describe the new improvemnt.
Some tricky scenario in the snapshot diff report need to be explained in the document.


> Utilize Snapshot diff report to build copy list in distcp
> ---------------------------------------------------------
>                 Key: HDFS-8828
>                 URL: https://issues.apache.org/jira/browse/HDFS-8828
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: distcp, snapshots
>            Reporter: Yufei Gu
>            Assignee: Yufei Gu
>         Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch
> Some users reported huge time cost to build file copy list in distcp. (30 hours for 1.6M
files). We can leverage snapshot diff report to build file copy list including files/dirs
which are changes only between two snapshots (or a snapshot and a normal dir). It speed up
the process in two folds: 1. less copy list building time. 2. less file copy MR jobs.
> HDFS snapshot diff report provide information about file/directory creation, deletion,
rename and modification between two snapshots or a snapshot and a normal directory. HDFS-7535
synchronize deletion and rename, then fallback to the default distcp. So it still relies on
default distcp to building complete list of files under the source dir. This patch only puts
creation and modification files into the copy list based on snapshot diff report. We can minimize
the number of files to copy. 

This message was sent by Atlassian JIRA

View raw message