hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6382) HDFS File/Directory TTL
Date Mon, 09 Jun 2014 19:17:03 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025585#comment-14025585
] 

Colin Patrick McCabe commented on HDFS-6382:
--------------------------------------------

For the MR strategy, it seems like this could be parallelized fairly easily.  For example,
if you have 5 MR tasks, you can calculate the hash of each path, and then task 1 can do all
the paths that are 0 mod 5, task 2 can do all the paths that are 1 mod 5, and so forth.  MR
also doesn't introduce extra dependencies since HDFS and MR are packaged together.

I don't understand what you mean by "the mapreduce strategy will have additional overheads."
 What overheads are you forseeing?

It is true that you need to avoid overloading the NameNode.  But this is a concern with any
approach, not just the MR one.  It would be good to see a section on this.  I think the simplest
way to do it is to rate-limit RPCs to the NameNode to a configurable rate.

bq. \[for the standalone daemon\] The major advantage of this approach is that we don’t
need any extra work to finish the TTL work, all will be done in the daemon automatically.


I don't understand what you mean by this.  What will be done automatically?

How are you going to implement HA for the standalone daemon?  I suppose if all the state is
kept in HDFS, you can simply restart it when it fails.  However, it seems like you need to
checkpoint how far along in the FS you are, so that if you die and later get restarted, you
don't have to redo the whole FS scan.  This implies reading directories in alphabetical order,
or similar.  You also need to somehow record when the last scan was, perhaps in a file in
HDFS.

I don't see a lot of discussion of logging and monitoring in general.  How is the user going
to become aware that a file was deleted because of a TTL?  Or if there is an error during
the delete, how will the user know?  Logging is one choice here.  Creating a file in HDFS
is another.

The setTtl command seems reasonable.  Does this need to be an administrator command?

> HDFS File/Directory TTL
> -----------------------
>
>                 Key: HDFS-6382
>                 URL: https://issues.apache.org/jira/browse/HDFS-6382
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client, namenode
>    Affects Versions: 2.4.0
>            Reporter: Zesheng Wu
>            Assignee: Zesheng Wu
>         Attachments: HDFS-TTL-Design.pdf
>
>
> In production environment, we always have scenario like this, we want to backup files
on hdfs for some time and then hope to delete these files automatically. For example, we keep
only 1 day's logs on local disk due to limited disk space, but we need to keep about 1 month's
logs in order to debug program bugs, so we keep all the logs on hdfs and delete logs which
are older than 1 month. This is a typical scenario of HDFS TTL. So here we propose that hdfs
can support TTL.
> Following are some details of this proposal:
> 1. HDFS can support TTL on a specified file or directory
> 2. If a TTL is set on a file, the file will be deleted automatically after the TTL is
expired
> 3. If a TTL is set on a directory, the child files and directories will be deleted automatically
after the TTL is expired
> 4. The child file/directory's TTL configuration should override its parent directory's
> 5. A global configuration is needed to configure that whether the deleted files/directories
should go to the trash or not
> 6. A global configuration is needed to configure that whether a directory with TTL should
be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message