hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6584) Support Archival Storage
Date Fri, 12 Sep 2014 05:57:34 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131142#comment-14131142

Jing Zhao commented on HDFS-6584:

Thanks a lot for the great comments, [~andrew.wang]! Let me try to answer some of the questions
here, and I believe [~szetszwo] will provide more details later.

bq. When does the Mover actually migrate data? When a block is finalized? When the file is
closed? Some amount of time after? When the admin decides to run the Mover?

Currently the data is only migrated when admin runs the Mover.

bq. What is the load impact of scanning the namespace for files that need to be migrated?
A naive ls -R / type operation could be bad.

Yeah, scanning the namespace is definitely a big burden here. HDFS-6875 adds the support to
allow users to specify a list of paths for migration. And in the future we may want to support
running multiple Movers for disjoint directories concurrently or even utilizing MR.

bq. Why are policies specified in XML files rather than in the fsimage / edit log? It seems
very important to keep the policies consistent, and this is thus one more file that needs
to be synchronized and backed up. Stashing it in the editlog would do this for you.

Agree. Actually Nicholas and I had a discussion about this before, and I had a unfinished
preliminary patch but still need to think more about some details. We plan to finish this
work after the merge.

bq. Can storage policies be set at a directory level? Testing to confirm this either way?

Yes, this has been done in HDFS-6847.

bq. How does this interact with snapshots? With replication factor, I believe we use the maximum
replication factor across all snapshots. Here, would it be the union of all storage types
across all snapshots? Not sure how the Mover accounts for this, or if a full-union is the
right policy.

This has been addressed in HDFS-6969. Please see the discussion there.

bq. Do we have per-storage-type quotas? Are there APIs exposed to show, for instance, storage
type usage by a snapshot, by a directory, etc?

This is a very good suggestion, especially considering we also have storage type SSD and in
the future we may also have storage type MEMORY.

bq. How does this interact with open files?

Actually we should ignore the incomplete block which can be inferred from LocatedBlocks. I
will file a new jira for this. Thanks! 
In another scenario, if a block later gets appended during the migration, the new replica
will be marked as corrupted when it is reported to the NN because of the inconsistency of
generation stamp.

> Support Archival Storage
> ------------------------
>                 Key: HDFS-6584
>                 URL: https://issues.apache.org/jira/browse/HDFS-6584
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: balancer, namenode
>            Reporter: Tsz Wo Nicholas Sze
>            Assignee: Tsz Wo Nicholas Sze
>         Attachments: HDFS-6584.000.patch, HDFSArchivalStorageDesign20140623.pdf, HDFSArchivalStorageDesign20140715.pdf,
archival-storage-testplan.pdf, h6584_20140907.patch, h6584_20140908.patch, h6584_20140908b.patch,
h6584_20140911.patch, h6584_20140911b.patch
> In most of the Hadoop clusters, as more and more data is stored for longer time, the
demand for storage is outstripping the compute. Hadoop needs a cost effective and easy to
manage solution to meet this demand for storage. Current solution is:
> - Delete the old unused data. This comes at operational cost of identifying unnecessary
data and deleting them manually.
> - Add more nodes to the clusters. This adds along with storage capacity unnecessary compute
capacity to the cluster.
> Hadoop needs a solution to decouple growing storage capacity from compute capacity. Nodes
with higher density and less expensive storage with low compute power are becoming available
and can be used as cold storage in the clusters. Based on policy the data from hot storage
can be moved to cold storage. Adding more nodes to the cold storage can grow the storage independent
of the compute capacity in the cluster.

This message was sent by Atlassian JIRA

View raw message