Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 33C1C200BE3 for ; Wed, 7 Dec 2016 20:44:01 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 32644160AF9; Wed, 7 Dec 2016 19:44:01 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 81C3B160B2B for ; Wed, 7 Dec 2016 20:44:00 +0100 (CET) Received: (qmail 22109 invoked by uid 500); 7 Dec 2016 19:43:59 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 21856 invoked by uid 99); 7 Dec 2016 19:43:59 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Dec 2016 19:43:59 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 4F2852C03E5 for ; Wed, 7 Dec 2016 19:43:59 +0000 (UTC) Date: Wed, 7 Dec 2016 19:43:59 +0000 (UTC) From: "Manoj Govindassamy (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-11218) Add option to skip open files during HDFS Snapshots MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 07 Dec 2016 19:44:01 -0000 [ https://issues.apache.org/jira/browse/HDFS-11218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Govindassamy updated HDFS-11218: -------------------------------------- Description: *Problem:* When there are files being written and when HDFS Snapshots are taken in parallel, Snapshots do capture all these files, but these being written files in Snapshots do not have the point-in-time file length captured. At the time of File close or any other meta data modification operation on that file which was previously open, HDFS reconciles the file length and records the modification in the last taken Snapshot. All the previously taken Snapshots continue to have the same open File with no modification recorded. So, all those previous snapshots end up using the final modification record in the next available snapshot. *Proposal:* HDFS Snapshot Design goal was to have O(M) space usage for Snapshots, where M is the number file modifications. So, it would very expensive to record modifications for all the open files in all the snapshots. For applications that do not want to capture incomplete / partial being written binary files in the snapshots, it would be preferable to have an extra option to skip open files. This way, they don't have to worry about restoring inconsistent files from the snapshots. {noformat} hdfs dfs -createSnapshot -skipOpenFiles {noformat} was: Problem: When there are files being written and when HDFS Snapshots are taken in parallel, Snapshots do capture all these files, but these being written files in Snapshots do not have the point-in-time file length captured. At the time of File close or any other meta data modification operation on that file which was previously open, HDFS reconciles the file length and records the modification in the last taken Snapshot. All the previously taken Snapshots continue to have the same open File with no modification recorded. So, all those previous snapshots end up using the final modification record in the next available snapshot. Proposal: HDFS Snapshot Design goal was to have O(M) space usage for Snapshots, where M is the number file modifications. So, it would very expensive to record modifications for all the open files in all the snapshots. For applications that do not want to capture incomplete / partial being written binary files in the snapshots, it would be preferable to have an extra option to skip open files. This way, they don't have to worry about restoring inconsistent files from the snapshots. {noformat} hdfs dfs -createSnapshot -skipOpenFiles {noformat} > Add option to skip open files during HDFS Snapshots > --------------------------------------------------- > > Key: HDFS-11218 > URL: https://issues.apache.org/jira/browse/HDFS-11218 > Project: Hadoop HDFS > Issue Type: Improvement > Components: snapshots > Affects Versions: 3.0.0-alpha1 > Reporter: Manoj Govindassamy > Assignee: Manoj Govindassamy > > *Problem:* > When there are files being written and when HDFS Snapshots are taken in parallel, Snapshots do capture all these files, but these being written files in Snapshots do not have the point-in-time file length captured. > At the time of File close or any other meta data modification operation on that file which was previously open, HDFS reconciles the file length and records the modification in the last taken Snapshot. All the previously taken Snapshots continue to have the same open File with no modification recorded. So, all those previous snapshots end up using the final modification record in the next available snapshot. > *Proposal:* > HDFS Snapshot Design goal was to have O(M) space usage for Snapshots, where M is the number file modifications. So, it would very expensive to record modifications for all the open files in all the snapshots. For applications that do not want to capture incomplete / partial being written binary files in the snapshots, it would be preferable to have an extra option to skip open files. This way, they don't have to worry about restoring inconsistent files from the snapshots. > {noformat} > hdfs dfs -createSnapshot -skipOpenFiles > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org