hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4058) Transparent archival and restore of files from HDFS
Date Wed, 03 Sep 2008 05:53:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627909#action_12627909

dhruba borthakur commented on HADOOP-4058:

Hadoop user's have been using the Hadoop clusters as a queryable archive warehouse. This means
that data that once gets into the warehouse is very unlikely to be deleted. This puts tremendous
pressure on adding additional storage capacity to the production cluster.

There could be a set of storage-heavy nodes that cannot be added to the production cluster
because do not have enough memory and CPU. One option would be to use this old-cluster to
archive old files from the production cluster.

A layer of software can scan the file system in the production cluster to find files with
the earliest access times (HADOOP-1869). These files can be moved to the old-cluster and the
original file in the production cluster can be replaced by a symbolic link (via HADOOP-4044).
An access to read the original file still works because of the symbolic link. Some other piece
of software periodically scans the old-cluster, finds out files that were accessed recently,
and tries to move them back to the production cluster.

The advantage of this approach is that it is "layered"... it is not built into HDFS but depends
on two artifacts of HDFS: symbolic links and access-times. I hate to put more and more intelligence
into core-hdfs, otherwise the code becomes very bloated and difficult to maintain.

> Transparent archival and restore of files from HDFS
> ---------------------------------------------------
>                 Key: HADOOP-4058
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4058
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
> There should be a facility to migrate old files away from a production cluster. Access
to those files from applications should continue to work transparently, without changing application
code, but maybe with reduced performance. The policy engine  that does this could be layered
on HDFS rather than being built into HDFS itself.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message