Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8187C200B96 for ; Thu, 6 Oct 2016 09:08:23 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 80400160AED; Thu, 6 Oct 2016 07:08:23 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id BEB5C160AAD for ; Thu, 6 Oct 2016 09:08:22 +0200 (CEST) Received: (qmail 66614 invoked by uid 500); 6 Oct 2016 07:08:21 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 66598 invoked by uid 99); 6 Oct 2016 07:08:21 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Oct 2016 07:08:21 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 462362C2A6C for ; Thu, 6 Oct 2016 07:08:21 +0000 (UTC) Date: Thu, 6 Oct 2016 07:08:21 +0000 (UTC) From: "Yongjun Zhang (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-9820) Improve distcp to support efficient restore to an earlier snapshot MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 06 Oct 2016 07:08:23 -0000 [ https://issues.apache.org/jira/browse/HDFS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15551122#comment-15551122 ] Yongjun Zhang commented on HDFS-9820: ------------------------------------- HI [~andrew.wang], Many thanks again for the review. I uploaded rev 006 hopefully with all of your comments addressed. I added back the fallback feature (if sync fails, go back to the default distcp) of -diff (so not to change -diff's behavior), and added the fallback to -rdiff (so to be consistent with -diff). So this patch once committed, we can safely backport to 2.x if we like. I will create a jira for 3.0 to drop the fallback feature. Would you please help taking a look? Thanks much. > Improve distcp to support efficient restore to an earlier snapshot > ------------------------------------------------------------------ > > Key: HDFS-9820 > URL: https://issues.apache.org/jira/browse/HDFS-9820 > Project: Hadoop HDFS > Issue Type: New Feature > Components: distcp > Affects Versions: 2.6.4 > Reporter: Yongjun Zhang > Assignee: Yongjun Zhang > Attachments: HDFS-9820.001.patch, HDFS-9820.002.patch, HDFS-9820.003.patch, HDFS-9820.004.patch, HDFS-9820.005.patch, HDFS-9820.006.patch > > > A common use scenario (scenaio 1): > # create snapshot sx in clusterX, > # do some experiemnts in clusterX, which creates some files. > # throw away the files changed and go back to sx. > Another scenario (scenario 2) is, there is a production cluster and a backup cluster, we periodically sync up the data from production cluster to the backup cluster with distcp. > The cluster in scenario 1 could be the backup cluster in scenario 2. > For scenario 1: > HDFS-4167 intends to restore HDFS to the most recent snapshot, and there are some complexity and challenges. Before that jira is implemented, we count on distcp to copy from snapshot to the current state. However, the performance of this operation could be very bad because we have to go through all files even if we only changed a few files. > For scenario 2: > HDFS-7535 improved distcp performance by avoiding copying files that changed name since last backup. > On top of HDFS-7535, HDFS-8828 improved distcp performance when copying data from source to target cluster, by only copying changed files since last backup. The way it works is use snapshot diff to find out all files changed, and copy the changed files only. > See https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/ > This jira is to propose a variation of HDFS-8828, to find out the files changed in target cluster since last snapshot sx, and copy these from snapshot sx of either the source or the target cluster, to restore target cluster's current state to sx. > Specifically, > If a file/dir is > - renamed, rename it back > - created in target cluster, delete it > - modified, put it to the copy list > - run distcp with the copy list, copy from the source cluster's corresponding snapshot > This could be a new command line switch -rdiff in distcp. > As a native restore feature, HDFS-4167 would still be ideal to have. However, HDFS-9820 would hopefully be easier to implement, before HDFS-4167 is in place. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org