Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Fri, 12 Sep 2014 05:57:34 +0000 (UTC)
From: "Jing Zhao (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12722899.1403304022000.16334.1410501454354@Atlassian.JIRA>
In-Reply-To: <JIRA.12722899.1403304022000@Atlassian.JIRA>
References: <JIRA.12722899.1403304022000@Atlassian.JIRA>
 <JIRA.12722899.1403304022278@arcas>
Subject: [jira] [Commented] (HDFS-6584) Support Archival Storage
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131142#comment-14131142 ] 

Jing Zhao commented on HDFS-6584:
---------------------------------

Thanks a lot for the great comments, [~andrew.wang]! Let me try to answer some of the questions here, and I believe [~szetszwo] will provide more details later.

bq. When does the Mover actually migrate data? When a block is finalized? When the file is closed? Some amount of time after? When the admin decides to run the Mover?

Currently the data is only migrated when admin runs the Mover.

bq. What is the load impact of scanning the namespace for files that need to be migrated? A naive ls -R / type operation could be bad.

Yeah, scanning the namespace is definitely a big burden here. HDFS-6875 adds the support to allow users to specify a list of paths for migration. And in the future we may want to support running multiple Movers for disjoint directories concurrently or even utilizing MR.

bq. Why are policies specified in XML files rather than in the fsimage / edit log? It seems very important to keep the policies consistent, and this is thus one more file that needs to be synchronized and backed up. Stashing it in the editlog would do this for you.

Agree. Actually Nicholas and I had a discussion about this before, and I had a unfinished preliminary patch but still need to think more about some details. We plan to finish this work after the merge.

bq. Can storage policies be set at a directory level? Testing to confirm this either way?

Yes, this has been done in HDFS-6847.

bq. How does this interact with snapshots? With replication factor, I believe we use the maximum replication factor across all snapshots. Here, would it be the union of all storage types across all snapshots? Not sure how the Mover accounts for this, or if a full-union is the right policy.

This has been addressed in HDFS-6969. Please see the discussion there.

bq. Do we have per-storage-type quotas? Are there APIs exposed to show, for instance, storage type usage by a snapshot, by a directory, etc?

This is a very good suggestion, especially considering we also have storage type SSD and in the future we may also have storage type MEMORY.

bq. How does this interact with open files?

Actually we should ignore the incomplete block which can be inferred from LocatedBlocks. I will file a new jira for this. Thanks! 
In another scenario, if a block later gets appended during the migration, the new replica will be marked as corrupted when it is reported to the NN because of the inconsistency of generation stamp.

> Support Archival Storage
> ------------------------
>
>                 Key: HDFS-6584
>                 URL: https://issues.apache.org/jira/browse/HDFS-6584
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: balancer, namenode
>            Reporter: Tsz Wo Nicholas Sze
>            Assignee: Tsz Wo Nicholas Sze
>         Attachments: HDFS-6584.000.patch, HDFSArchivalStorageDesign20140623.pdf, HDFSArchivalStorageDesign20140715.pdf, archival-storage-testplan.pdf, h6584_20140907.patch, h6584_20140908.patch, h6584_20140908b.patch, h6584_20140911.patch, h6584_20140911b.patch
>
>
> In most of the Hadoop clusters, as more and more data is stored for longer time, the demand for storage is outstripping the compute. Hadoop needs a cost effective and easy to manage solution to meet this demand for storage. Current solution is:
> - Delete the old unused data. This comes at operational cost of identifying unnecessary data and deleting them manually.
> - Add more nodes to the clusters. This adds along with storage capacity unnecessary compute capacity to the cluster.
> Hadoop needs a solution to decouple growing storage capacity from compute capacity. Nodes with higher density and less expensive storage with low compute power are becoming available and can be used as cold storage in the clusters. Based on policy the data from hot storage can be moved to cold storage. Adding more nodes to the cold storage can grow the storage independent of the compute capacity in the cluster.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)