hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zesheng Wu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6382) HDFS File/Directory TTL
Date Tue, 10 Jun 2014 02:21:01 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026047#comment-14026047
] 

Zesheng Wu commented on HDFS-6382:
----------------------------------

Thanks [~cmccabe] for your feedback.
bq. For the MR strategy, it seems like this could be parallelized fairly easily. For example,
if you have 5 MR tasks, you can calculate the hash of each path, and then task 1 can do all
the paths that are 0 mod 5, task 2 can do all the paths that are 1 mod 5, and so forth. MR
also doesn't introduce extra dependencies since HDFS and MR are packaged together.
You mean that we scan the whole namespace at first and then split it into 5 pieces according
to hash of the path, why do we just complete the work during the first scanning process? If
I misunderstand your meaning, please point out.

bq. I don't understand what you mean by "the mapreduce strategy will have additional overheads."
What overheads are you foreseeing?
Possible overheads: Starting a mapreduce job needs to split the input, start an  AppMaster,
collect result from random machines (Perhaps 'overheads' is not a proper word here)

bq. I don't understand what you mean by this. What will be done automatically?
Here "automatically" means we do not have to rely on external tools, the daemon itself can
manage the work well.

bq. How are you going to implement HA for the standalone daemon?
Good point. As you suggested, one approach is save the state in HDFS and simply restart it
when it fails. But managing the state is a complex work, I am considering how to simplify
this. One possible simpler approach is that we can consider that the daemon is stateless and
simply restart it when if fails. We needn't do checkpoint and just scan from the beginning
when it restarts. Because we can require that the work the daemon does is idempotent, starting
from the beginning will be harmless. Possible drawbacks of the later approach are that it
may waste some time and may delay the work, but they are acceptable. 

bq. I don't see a lot of discussion of logging and monitoring in general. How is the user
going to become aware that a file was deleted because of a TTL? Or if there is an error during
the delete, how will the user know? 
For the simplicity purpose, in the initial version, we will use logs to record which file/directory
is deleted by TTL, and errors during the deleting process.

bq. Does this need to be an administrator command?
It doesn't need to be an administrator command, user only can setTtl on file/directory that
they have write permission, and can getTtl on file/directory that they have read permission.

> HDFS File/Directory TTL
> -----------------------
>
>                 Key: HDFS-6382
>                 URL: https://issues.apache.org/jira/browse/HDFS-6382
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client, namenode
>    Affects Versions: 2.4.0
>            Reporter: Zesheng Wu
>            Assignee: Zesheng Wu
>         Attachments: HDFS-TTL-Design.pdf
>
>
> In production environment, we always have scenario like this, we want to backup files
on hdfs for some time and then hope to delete these files automatically. For example, we keep
only 1 day's logs on local disk due to limited disk space, but we need to keep about 1 month's
logs in order to debug program bugs, so we keep all the logs on hdfs and delete logs which
are older than 1 month. This is a typical scenario of HDFS TTL. So here we propose that hdfs
can support TTL.
> Following are some details of this proposal:
> 1. HDFS can support TTL on a specified file or directory
> 2. If a TTL is set on a file, the file will be deleted automatically after the TTL is
expired
> 3. If a TTL is set on a directory, the child files and directories will be deleted automatically
after the TTL is expired
> 4. The child file/directory's TTL configuration should override its parent directory's
> 5. A global configuration is needed to configure that whether the deleted files/directories
should go to the trash or not
> 6. A global configuration is needed to configure that whether a directory with TTL should
be deleted when it is emptied by TTL mechanism or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message