hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1869) access times of HDFS files
Date Mon, 10 Sep 2007 19:33:29 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526232

Allen Wittenauer commented on HADOOP-1869:

Currently on our bigger grids, we have a significant amount of files that we aren't sure whether
anyone is actually using or not (e.g., /tmp).  While I recognize that atime is a huge performance
killer, whenever one deals with users who have free reign over their space, it is an incredibly
important tool to help maintain the system.

This is especially important given the lack of ACLs.  On our larger grids, there are many
files that are just kind of scattered all over that we have no real insight as to their purpose,
much less their usage pattern.  All we know is that we didn't put them there. :)

Having an access log would at least tell us whether something is being used.  Once users get
added to the system, having the user information combined with whether a file was touched
will be extremely handy. 

Operationally, I see this being used by dumping the data on a regular interval into an RDBMS
or perhaps even inside HDFS itself.  It is then fairly trivial to create tools and form policies
around data retention.

> access times of HDFS files
> --------------------------
>                 Key: HADOOP-1869
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1869
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: dhruba borthakur
> HDFS should support some type of statistics that allows an administrator to determine
when a file was last accessed. 
> Since HDFS does not have quotas yet, it is likely that users keep on accumulating files
in their home directories without much regard to the amount of space they are occupying. This
causes memory-related problems with the namenode.
> Access times are costly to maintain. AFS does not maintain access times. I thind DCE-DFS
does maintain access times with a coarse granularity.
> One proposal for HDFS would be to implement something like an "access bit". 
> 1. This access-bit is set when a file is accessed. If the access bit is already set,
then this call does not result in a transaction.
> 2. A FileSystem.clearAccessBits() indicates that the access bits of all files need to
be cleared.
> An administrator can effectively use the above mechanism (maybe a daily cron job) to
determine files that are recently used.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message