hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron T. Myers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2802) Support for RW/RO snapshots in HDFS
Date Mon, 22 Oct 2012 19:22:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481682#comment-13481682

Aaron T. Myers commented on HDFS-2802:

bq. Why is starting with simple implementation and then optimizing it later not a choice?

This sort of fundamental design decision is not something that can be easily improved incrementally.
Copying huge portions of the working set, and then making that copying fast and space efficient,
should not be the goal. The goal should be to entirely avoid copying huge portions of the
working set.

bq. O(1) memory usage in general does not not seem possible since the original files/directories
could be modified. So the best case is O(N) memory usage in general. However, it is possible
to have O(1) memory usage at snapshot creation.

I agree that it's not possible to have an O(1) solution in terms of the number of files/directories
that are modified, but it is certainly possible to have an O(1) solution in terms of the number
of files/directories that are _not_ modified. That's the issue I'm concerned about, and as
far as I can tell is not what is proposed by this design document.

bq. For small subtrees, i.e. when N is small, it does not matter if it is O(1) or O(N). Such
snapshot feature already benefits many applications. So we are going to implement O(N) snapshot
creation in the first phase and then optimization it later.

What are the use cases for taking snapshots of small subtrees? An implementation that is suitable
only for small subtrees is probably impractical to snapshot an HBase root directory, a Hive
warehouse, or most /user directories that I'm aware of. You'd also presumably want to keep
at least a handful (10s?) of snapshots available, so any small subtree that could be snapshotted
must be multiplied by ~10 to consider its snapshot size. Note that the design document also
explicitly states that it should be possible to take a snapshot of the root of the file system.

bq. Then, we could have the snapshot feature out early instead of spending a long time to
come up a complicated design and implementation. A complicated design also increases the risk
of bugs in the implementation.

I'm all for a simple design, but the design must also meet the stated requirements. The design
document states that it should be possible to create a snapshot of the root of the file system,
but I don't think the proposed design can do such a thing.

Suresh had previously said that "May be the design document is fairly early and might have
misled you. That is not the goal. The goal is to have efficient implementation." If we're
on the same page that the end goal is an O(1) implementation, in terms of the number of files
that are not modified between snapshots, in this branch then we can move on.
> Support for RW/RO snapshots in HDFS
> -----------------------------------
>                 Key: HDFS-2802
>                 URL: https://issues.apache.org/jira/browse/HDFS-2802
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: data-node, name-node
>            Reporter: Hari Mankude
>            Assignee: Hari Mankude
>         Attachments: snap.patch, snapshot-one-pager.pdf, Snapshots20121018.pdf
> Snapshots are point in time images of parts of the filesystem or the entire filesystem.
Snapshots can be a read-only or a read-write point in time copy of the filesystem. There are
several use cases for snapshots in HDFS. I will post a detailed write-up soon with with more

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message