hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
Date Thu, 19 Sep 2013 17:34:57 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13772074#comment-13772074
] 

Colin Patrick McCabe commented on HADOOP-9972:
----------------------------------------------

I guess I should talk about the motivation here.  Daryn Sharp, Kihwal Lee, Nathan Roberts,
Eli Collins, Andrew Wang, and myself had a discussion about the new symlinks support in FileSystem
in Hadoop 2.  The Yahoo! guys were concerned that if listStatus started returning symlinks,
a lot of user code would break.  One example is code that assumes that if FileStatus#isFile
is false, then the inode is a directory.  Obviously, that's false in the case of symlinks.

To prevent this scenario, we want to change FileStatus#listStatus and FileStatus#globStatus
to resolve all symlinks, and then provide an extended API for users who don't want that auto-resolve
behavior.  That's what this discussion is about-- what that extended API should look like.

The discussion about whether HDFS should replace listStatus with something more like POSIX
readdir seems like a tangent.  That's an interesting thing to discuss, but it doesn't really
solve our problem in branch-2.1-beta, since there is still going to be code around that calls
listStatus and globStatus for a long, long time.

This is a tangent, but I'm not even convinced that we should replace {{listStatus}} with {{readdir}}.
 The reason why {{listStatus}} returns {{FileStatus[]}} rather than just a list of paths and
file types is to minimize the number of network round trips to the NameNode.  That is still
something we care about.  If you run {{/bin/ls}} with strace, you'll see that ls calls {{getdents}}
(the implementation of readdir) and then makes an {{lstat}} call on each path name in the
directory.  If the HDFS shell did the same thing, it would have to dramatically increase the
number of RPCs it made to the NameNode.

Also see Jason Lowe's comment here: https://issues.apache.org/jira/browse/HADOOP-9912?focusedCommentId=13772002&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13772002
                
> new APIs for listStatus and globStatus to deal with symlinks
> ------------------------------------------------------------
>
>                 Key: HADOOP-9972
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9972
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 2.1.1-beta
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>
> Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with
symlinks.  The issue is that code has been written which is incompatible with the existence
of things which are not files or directories.  For example,
> there is a lot of code out there that looks at FileStatus#isFile, and
> if it returns false, assumes that what it is looking at is a
> directory.  In the case of a symlink, this assumption is incorrect.
> It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}}
be fully resolving symlinks, and ignoring dangling ones.  This will prevent incompatibility
with existing MR jobs and other HDFS users.  We should also add new versions of listStatus
and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message