Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EB56F181BE for ; Sun, 2 Aug 2015 07:38:04 +0000 (UTC) Received: (qmail 37613 invoked by uid 500); 2 Aug 2015 07:38:04 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 37553 invoked by uid 500); 2 Aug 2015 07:38:04 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 37540 invoked by uid 99); 2 Aug 2015 07:38:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Aug 2015 07:38:04 +0000 Date: Sun, 2 Aug 2015 07:38:04 +0000 (UTC) From: "Yongjun Zhang (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-8828) Utilize Snapshot diff report to build copy list in distcp MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14650627#comment-14650627 ] Yongjun Zhang commented on HDFS-8828: ------------------------------------- Hi [~yufeigu], Thanks for reporting the issue and the patch! It would be a very good performance improvement for distcp if we can skip files copied previously. I browsed the patch and have some comments: # Would you please add java doc to the new methods you introduced including the test area (to describe what the test does)? # There is coding guideline of 80 chars per line. I saw some in your change that exceeded 80. # With the {{type}} field introduced, constructor {{DiffInfo(Path source, Path target)}} would leave type undefined. sounds this constructor should be totally replaced with {{DiffInfo(Path source, Path target, SnapshotDiffReport.DiffType type)}}. # javadoc is needed here: {{static DiffInfo[] getDiffsForListBuilding(SnapshotDiffReport report)}} check RENAME but doesn't check DELETE operation; DistCpSync#sync checks RENAME and DELETE operation. I can see that these two methods work together: deleted files are handled by the sync method only; but renamed file may be modfied too, thus need to be handled by both methods. # Before we see it fully functional, it might be a good idea to introduce a new distcp switch to enable this improvement. When disabled, it behave like before. However, the new switch might make user interface and internal implemenatation more complex. If everything works nicely, we might not need the new switch. We can address this comment later. # Add space between ")" and "throws" in {{private Path getFullPath(Path path)throws IOException}} # Generally, add a space between "//" and the text following it # {{private Path getFullPath(Path path)throws IOException}} is not clear to me: ## why it need a loop if it just need to use the last entry in {{inputs}}? ## {{path = root; return path}} can be simplified as {{return root}} ## is {{root}} the right variable name? The method is {{getFullPath}}, seems not an accurate name. I guess you meant {{getFullPathOfRoot}}? # The following comment is not very clear to me: {code} // if two paths are the same, just modify the target path of // "MODIFY" actions, ignore "CREATE" action. {code} Suggest to try to make it more clear. # When the patch gets to a shape, we need to update distcp document to describe the new improvemnt. Some tricky scenario in the snapshot diff report need to be explained in the document. Thanks. > Utilize Snapshot diff report to build copy list in distcp > --------------------------------------------------------- > > Key: HDFS-8828 > URL: https://issues.apache.org/jira/browse/HDFS-8828 > Project: Hadoop HDFS > Issue Type: Improvement > Components: distcp, snapshots > Reporter: Yufei Gu > Assignee: Yufei Gu > Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch > > > Some users reported huge time cost to build file copy list in distcp. (30 hours for 1.6M files). We can leverage snapshot diff report to build file copy list including files/dirs which are changes only between two snapshots (or a snapshot and a normal dir). It speed up the process in two folds: 1. less copy list building time. 2. less file copy MR jobs. > HDFS snapshot diff report provide information about file/directory creation, deletion, rename and modification between two snapshots or a snapshot and a normal directory. HDFS-7535 synchronize deletion and rename, then fallback to the default distcp. So it still relies on default distcp to building complete list of files under the source dir. This patch only puts creation and modification files into the copy list based on snapshot diff report. We can minimize the number of files to copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)