hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joe Kraska (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-233) Support for snapshots
Date Sat, 07 Jan 2012 18:39:39 GMT

    [ https://issues.apache.org/jira/browse/HDFS-233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182056#comment-13182056

Joe Kraska commented on HDFS-233:

Reviewing the comments and noting the dataware housing feature requests and the like, I thought
I would comment on the snapshot feature from the more pragmatic perspective of simple, responsible
data stewardship.

By and large, the most important features of snapshots are being able to:
1. Do them live.
2. Do them economically: do not require particularly large amounts of space for the snapshot.
3. Being able to have a dozen or so (and often less).
4. Being able to schedule them (hourly, daily, weekly, with emphasis on the latter two)
5. Being able to selectively restore portions of the tree due to user- or program- caused
erasure or damage
6. Being able to quickly conduct a restore of either a sub portion of the tree or an entire

The above set of features are about fundamental data protection, cost, and restore time objectives.
They are directly related to economical data stewardship, and are considered the first line
of defense for data protection in many enterprises today. I.e., we data stewards prefer these
features over tape restores (although we also use tape, we hate it).

*AFTER* the above, space-efficient *writable* snapshots are interesting. This is because there
are applications for test for current data sets where touching the master data set is a complete
no-no, but the application needs to make trial changes. These snapshots are often made, modified
for a while, then deleted.

You will want minimal performance impact for these snapshots, because the assumption should
be that the scheduled snapshot system is ALWAYS used. The one exception to this is static
read-only data where a single manual snapshot is recorded just once. Everything else will
have something like 2 daily and 2 weekly snapshots going all the time. Some enterprises will
also use hourly snapshots scheduled every 6 hours or so and retain about a day of those...

As a side note (and no offense to the hadoop community), I regard all shared storage as defective
for data stewardship purpose if it does not have the above features (except writable snapshots,
that's candy), and I am not the least bit alone. Any data protection strategy that says "go
to tape for that" as its first offer is... onerous.

While the following matter is merely my opinion, I feel pretty sure that the rise of the enterprise
NAS appliance (e.g., NetApp et al) is at least partly due to the default nature of snapshot
protection on those devices. Food for thought.

> Support for snapshots
> ---------------------
>                 Key: HDFS-233
>                 URL: https://issues.apache.org/jira/browse/HDFS-233
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: Snapshots.pdf, Snapshots.pdf
> Support HDFS snapshots. It should support creating snapshots without shutting down the
file system. Snapshot creation should be lightweight and a typical system should be able to
support a few thousands concurrent snapshots. There should be a way to surface (i.e. mount)
a few of these snapshots simultaneously.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message