Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Sat, 7 Jan 2012 18:39:39 +0000 (UTC)
From: "Joe Kraska (Commented) (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: 
 <884526496.18855.1325961579349.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (HDFS-233) Support for snapshots
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182056#comment-13182056 ] 

Joe Kraska commented on HDFS-233:
---------------------------------

Reviewing the comments and noting the dataware housing feature requests and the like, I thought I would comment on the snapshot feature from the more pragmatic perspective of simple, responsible data stewardship.

By and large, the most important features of snapshots are being able to:
1. Do them live.
2. Do them economically: do not require particularly large amounts of space for the snapshot.
3. Being able to have a dozen or so (and often less).
4. Being able to schedule them (hourly, daily, weekly, with emphasis on the latter two)
5. Being able to selectively restore portions of the tree due to user- or program- caused erasure or damage
6. Being able to quickly conduct a restore of either a sub portion of the tree or an entire volume.

The above set of features are about fundamental data protection, cost, and restore time objectives. They are directly related to economical data stewardship, and are considered the first line of defense for data protection in many enterprises today. I.e., we data stewards prefer these features over tape restores (although we also use tape, we hate it).

*AFTER* the above, space-efficient *writable* snapshots are interesting. This is because there are applications for test for current data sets where touching the master data set is a complete no-no, but the application needs to make trial changes. These snapshots are often made, modified for a while, then deleted.

You will want minimal performance impact for these snapshots, because the assumption should be that the scheduled snapshot system is ALWAYS used. The one exception to this is static read-only data where a single manual snapshot is recorded just once. Everything else will have something like 2 daily and 2 weekly snapshots going all the time. Some enterprises will also use hourly snapshots scheduled every 6 hours or so and retain about a day of those...

As a side note (and no offense to the hadoop community), I regard all shared storage as defective for data stewardship purpose if it does not have the above features (except writable snapshots, that's candy), and I am not the least bit alone. Any data protection strategy that says "go to tape for that" as its first offer is... onerous.

While the following matter is merely my opinion, I feel pretty sure that the rise of the enterprise NAS appliance (e.g., NetApp et al) is at least partly due to the default nature of snapshot protection on those devices. Food for thought.

                
> Support for snapshots
> ---------------------
>
>                 Key: HDFS-233
>                 URL: https://issues.apache.org/jira/browse/HDFS-233
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: Snapshots.pdf, Snapshots.pdf
>
>
> Support HDFS snapshots. It should support creating snapshots without shutting down the file system. Snapshot creation should be lightweight and a typical system should be able to support a few thousands concurrent snapshots. There should be a way to surface (i.e. mount) a few of these snapshots simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira