hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1869) access times of HDFS files
Date Mon, 18 Aug 2008 23:19:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623489#action_12623489

Konstantin Shvachko commented on HADOOP-1869:

I think this proposal is in the right direction.
According to HADOOP-3860 name-node can currently perform 20-25 times more opens per second
than creates.
Which means that if we let every open / getBlockLocation be logged and flushed we loose big.
Another observation is that map-reduce does a lot of {{ls}} operations both for directories
and individual files.
I have seen 20,000 per second. This is done when the job starts and depends on the user input
data and on how many tasks should the job be running.
So may be we should not log file access for ls, permission checking, etc. I think it would
be sufficient to write
{{OP_SET_ACCESSTIME}} only in case of getBlockLocations().
Also I think we should not support access time for directories only for regular files.

Another alternative would be to keep the access time only in the name-node memory. Would that
be sufficient enough to detect "malicious"
behavior of some users? Name-nodes usually run for months, right? So before say upgrading
the name-node or simply every (other) week 
administrators may look at files that have never been touched during that period and act accordingly.
My main concern is that even though with Dhruba's approach we will batch access operations
and would not loose time on flushing them
directly the journaling traffic will double that is with each flush more bytes need to be
flushed. Meaning increased latency for each flush,
and bigger edits files.

It would be good to have some experimental data measuring throughput and latency for getBlockLocation
with and without ACCESSTIME
transactions. The easy way to test would be to use NNThroughputBenchmark.

> access times of HDFS files
> --------------------------
>                 Key: HADOOP-1869
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1869
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
> HDFS should support some type of statistics that allows an administrator to determine
when a file was last accessed. 
> Since HDFS does not have quotas yet, it is likely that users keep on accumulating files
in their home directories without much regard to the amount of space they are occupying. This
causes memory-related problems with the namenode.
> Access times are costly to maintain. AFS does not maintain access times. I thind DCE-DFS
does maintain access times with a coarse granularity.
> One proposal for HDFS would be to implement something like an "access bit". 
> 1. This access-bit is set when a file is accessed. If the access bit is already set,
then this call does not result in a transaction.
> 2. A FileSystem.clearAccessBits() indicates that the access bits of all files need to
be cleared.
> An administrator can effectively use the above mechanism (maybe a daily cron job) to
determine files that are recently used.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message