hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Baranick (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14124) S3AFileSystem silently deletes "fake" directories when writing a file.
Date Mon, 27 Feb 2017 23:18:45 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886734#comment-15886734
] 

Joel Baranick commented on HADOOP-14124:
----------------------------------------

Hey Steve,

Thanks for the info.  I read the Hadoop Filesystem specification and it seems like this scenario
is breaking some of the specification.

First, the postcondition of the specification for {{FSDataOutputStream create(Path, ...)}}
states that the "... updated (valid) FileSystem must contains all the parent directories of
the path, as created by mkdirs(parent(p)).".  I would content that in this scenario, the opposite
is happening.

Second, the "Empty (non-root) directory" postcondition of the specification for {{FileSystem.delete(Path
P, boolean recursive)}} states that "Deleting an empty directory that is not root will remove
the path from the FS and return true.".  While this is occurring, I think that considering
the a fake directory empty even if it has another fake directory in it is incorrect.  For
example, on debian, the following doesn't work. 
{noformat}
[~]# mkdir job
[~]# cd job
[job]# mkdir task
[job]# cd ..
[~]# rmdir job
rmdir: failed to remove ‘job’: Directory not empty
{noformat}

Additionally, the interaction of AmazonS3Client/CyberDuck with empty directories seems different
than you described.  See the following scenario:
# Open CyberDuck and connect to an S3 Bucket
# Create a folder called {{job}} in CyberDuck
# Right Click on the {{job}} folder and open +Info+. Result: _Size = 0B_ and _S3 tab works_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}.  Result: _Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}. Result:
#* _job/_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir;empty^
# Navigate into the {{job}} folder in CyberDuck
# Create a folder called {{task}} in CyberDuck
# Right Click on the {{task}} folder and open +Info+.  Result: _Size = 0B_ and _S3 tab works_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}  Result: _Success_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/task/")}}  Result: _Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}.  Result:
#* _job/_
#* _job/task/_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir;empty^
#* _s3a://bucket/job/task_ ^dir^
# Upload _file_ into _/job/task_ via CyberDuck
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}  Result: _Success_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/task/")}}  Result: _Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}.  Result:
#* _job/_
#* _job/task/_
#* _job/task/file_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir;empty^
#* _s3a://bucket/job/task_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/task/"))}}.  Result: 
#* _s3a://bucket/job/task_ ^dir;empty^
#* _s3a://bucket/job/task/file_ ^file^

At this point, if you delete {{/job/task/file}} in CyberDuck or the AWS Console, the {{/job}}
and {{/job/task}} folders continue to exists and all calls continue to return the same results
as before (except {{/job/task/file}} is excluded from any list results).  If, on the other
hand, you created {{/job/task/file}} via S3AFileSystem, it would implicitly remove the parent
folders which it considers "empty".  Then when {{/job/task/file}} is deleted, the parent "empty"
directories are also gone.

My last counterpoint to the current Hadoop behavior with regard to S3A is the AWS S3 Console.
 It effectively models a filesystem despite the fact that it is backed by a blobstore.  I'm
able to create nested folders, upload a file, delete the file, and the nested "empty" folders
still exist.  As to the consistency guarantees, this is solved by EMR, making even more like
a true FileSystem.

Thanks!

> S3AFileSystem silently deletes "fake" directories when writing a file.
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-14124
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14124
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs, fs/s3
>    Affects Versions: 2.6.0
>            Reporter: Joel Baranick
>              Labels: filesystem, s3
>
> I realize that you guys probably have a good reason for {{S3AFileSystem}} to cleanup
"fake" folders when a file is written to S3.  That said, that fact that it silently does this
feels like a separation of concerns issue.  It also leads to weird behavior issues where calls
to {{AmazonS3Client.getObjectMetadata}} for folders work before calling {{S3AFileSystem.create}}
but not after.  Also, there seems to be no mention in the javadoc that the {{deleteUnnecessaryFakeDirectories}}
method is automatically invoked. Lastly, it seems like the goal of {{FileSystem}} should be
to ensure that code built on top of it is portable to different implementations.  This behavior
is an example of a case where this can break down.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message