Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 69C9C18860 for ; Wed, 17 Feb 2016 15:33:29 +0000 (UTC) Received: (qmail 75773 invoked by uid 500); 17 Feb 2016 15:33:18 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 75608 invoked by uid 500); 17 Feb 2016 15:33:18 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 75439 invoked by uid 99); 17 Feb 2016 15:33:18 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Feb 2016 15:33:18 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 18BBB2C1F5B for ; Wed, 17 Feb 2016 15:33:18 +0000 (UTC) Date: Wed, 17 Feb 2016 15:33:18 +0000 (UTC) From: "Yongjun Zhang (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HDFS-9820) Improve distcp to support efficient restore MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Yongjun Zhang created HDFS-9820: ----------------------------------- Summary: Improve distcp to support efficient restore Key: HDFS-9820 URL: https://issues.apache.org/jira/browse/HDFS-9820 Project: Hadoop HDFS Issue Type: New Feature Components: distcp Reporter: Yongjun Zhang Assignee: Yongjun Zhang HDFS-4167 intends to restore HDFS to the most recent snapshot, and there are some complexity and challenges. HDFS-7535 improved distcp performance by avoiding copying files that changed name since last backup. On top of HDFS-7535, HDFS-8828 improved distcp performance when copying data from source to target cluster, by only copying changed files since last backup. The way it works is use snapshot diff to find out all files changed, and copy the changed files only. See https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/ This jira is to propose a variation of HDFS-8828, to find out the files changed in target cluster since last snapshot sx, and copy these from the source target's same snapshot sx, to restore target cluster to sx. If a file/dir is - renamed, rename it back - created in target cluster, delete it - modified, put it to the copy list - run distcp with the copy list, copy from the source cluster's corresponding snapshot This could be a new command line switch -rdiff in distcp. -- This message was sent by Atlassian JIRA (v6.3.4#6332)