hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Preston Pfarner (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HADOOP-1572) should have utime method in HDFS & FIleSystem to set modification times.
Date Tue, 17 Feb 2009 21:50:59 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674367#action_12674367
] 

pfarner edited comment on HADOOP-1572 at 2/17/09 1:49 PM:
------------------------------------------------------------------

Copying between filesystems (mentioned above) and restore from backup are familiar use cases.

Here's another use case: repeated update of the output of a data pipeline.  If we have an
input files A1,A2,..An, which need to be aggregated into file B whenever another file Ai is
created, then it's useful to have an easy way to know when B needs to be updated.  If one
compares the modification time of B to the modification times of A1,..An, then a race condition
can cause some updates of B to be delayed forever.  If we could modify the modification time
of B, then we could avoid this race condition cleanly.  (details below, for the curious)

Race condition sequence:
  * A1 and A2 are created
  * transformation operation to create B starts, chooses A1 and A2 as inputs
  * A3 is created
  * output of the transformation operation is stored as B
At this point, B contains data from A1 and A2, but not from A3, and yet B's modification time
is later than A3's.  If we could set the timestamp, we could choose A1,A2 as inputs, record
the maximum of their timestamps (tmax) in the JobConf, and then create B with a modification
time of tmax.  tmax would be less than A3's modification time, so it would be clear that B
needs to be updated and the race condition would be prevented.

In order to avoid this problem in current systems, I create a secondary file containing the
timestamp for B, in its text, but this doubles the number of name node entries needed for
B, and is slower than using the modification time.


This change would be a significant improvement in my use of hadoop, so I'm naturally motivated
to help.  I've created a prototype of a patch (using the "option 2" style above), and I'll
refine it, confirm that I'm complying with hadoop's style guidelines, and post it here.  Any
information you have on special complications would be appreciated.  (I know that some FileSystems
won't support this operation, but I'm more worried about subtler problems).

      was (Author: pfarner):
    Copying between filesystems (mentioned above) and restore from backup are familiar use
cases.

Here's another use case: incremental update of a data pipeline.  If we have an input files
A1,A2,..An, which needs to be aggregated into file B whenever another file Ai is created,
then it's useful to have an easy way to know when B needs to be updated.  If one compares
the modification time of B to the modification times of A1,..An, then a race condition can
cause some updates of B to be delayed forever.  If we could modify the modification time of
B, then we could avoid this race condition cleanly.  (details below, for the curious)

Race condition sequence:
  * A1 and A2 are created
  * transformation operation to create B starts, chooses A1 and A2 as inputs
  * A3 is created
  * output of the transformation operation is stored as B
At this point, B contains data from A1 and A2, but not from A3, and yet B's modification time
is later than A3's.  If we could set the timestamp, we could choose A1,A2 as inputs, record
the maximum of their timestamps (tmax) in the JobConf, and then create B with a modification
time of tmax.  tmax would be less than A3's modification time, so the race condition would
be prevented.

In order to avoid this problem in current systems, I create a secondary file containing the
timestamp for B as text, but this doubles the number of name node entries needed for B, and
is slower than using the modification time.


This change would be a significant improvement in my use of hadoop, so I'm naturally motivated
to help.  I've created a prototype of a patch (using the "option 2" style above), and I'll
refine it, confirm that I'm complying with hadoop's style guidelines, and post it here.  Any
information you have on special complications would be appreciated.  (I know that some FileSystems
won't support this operation, but I'm more worried about subtler problems).
  
> should have utime method in HDFS & FIleSystem to set modification times.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-1572
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1572
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Owen O'Malley
>
> It would be nice to modify the modification times of files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message