hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6584) Support Archival Storage
Date Mon, 15 Sep 2014 18:38:37 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134276#comment-14134276

Jing Zhao commented on HDFS-6584:

Thanks [~andrew.wang]!

bq. Since the Mover is based on the Balancer, is there any concern about it being too slow
to move data from fast storage to archival? If all data migrates off to archival, the mover
needs to keep up with the aggregate write rate of the cluster. The balancer, putting it mildly,
is not the fastest tool in this regard.

Here are some of my thoughts. Please let me know if I miss something, [~szetszwo].
1) Currently the migration tool still depends on admin to mark files/dirs as COLD/WARM, it
may be rare that users still actively writing new data into a
directory after marking it as COLD. Thus for now this may not be a critical concern.
2) Tools/services may later be developed to actively/automatically scan the namespace and
mark COLD files based on different rules such as access/modification time. In some cases,
if the rule is very aggressive and the migration is very slow, we may have the issue you mentioned.
The current Mover is utilizing the Dispatcher, or more generally, the {{DataTransferProtocol#replaceBlock}}
protocol. I guess with more aggressive settings (e.g., the max number of blocks scheduled
on each DataNode for migration), the migration speed should not be very slow, and it should
be easy for us to replace the Dispatcher with a faster migration framework.

bq. We exposed cachedHosts in BlockLocation, so application schedulers can choose to place
their tasks for cache locality. We need a similar thing for storage type, so schedulers can
prefer "hotter" replicas.
This is a very good suggestion, we can add this information later. Thanks!

BTW, HDFS-7062 has been committed to fix the open file issue. A doc patch has been uploaded
in HDFS-6864. Thanks again for the great comments, [~andrew.wang]!

> Support Archival Storage
> ------------------------
>                 Key: HDFS-6584
>                 URL: https://issues.apache.org/jira/browse/HDFS-6584
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: balancer, namenode
>            Reporter: Tsz Wo Nicholas Sze
>            Assignee: Tsz Wo Nicholas Sze
>         Attachments: HDFS-6584.000.patch, HDFSArchivalStorageDesign20140623.pdf, HDFSArchivalStorageDesign20140715.pdf,
archival-storage-testplan.pdf, h6584_20140907.patch, h6584_20140908.patch, h6584_20140908b.patch,
h6584_20140911.patch, h6584_20140911b.patch, h6584_20140915.patch
> In most of the Hadoop clusters, as more and more data is stored for longer time, the
demand for storage is outstripping the compute. Hadoop needs a cost effective and easy to
manage solution to meet this demand for storage. Current solution is:
> - Delete the old unused data. This comes at operational cost of identifying unnecessary
data and deleting them manually.
> - Add more nodes to the clusters. This adds along with storage capacity unnecessary compute
capacity to the cluster.
> Hadoop needs a solution to decouple growing storage capacity from compute capacity. Nodes
with higher density and less expensive storage with low compute power are becoming available
and can be used as cold storage in the clusters. Based on policy the data from hot storage
can be moved to cold storage. Adding more nodes to the cold storage can grow the storage independent
of the compute capacity in the cluster.

This message was sent by Atlassian JIRA

View raw message