hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
Date Tue, 17 Sep 2013 16:33:51 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13769661#comment-13769661
] 

Colin Patrick McCabe commented on HADOOP-9972:
----------------------------------------------

I guess I should add a few words about why {{PathErrorHandler}} is necessary.  Basically,
we want to give users of {{globStatus}} flexibility.

For example, let's say you have the following directories:
/a owned by superuser, mode 0000
/b owned by bob, mode 0777

Bob would like to be able to get back a result from {{globStatus(/\*/stuff)}}, not just an
AccessControlException (which came out of trying to access /a/stuff).  But bob also doesn't
necessarily want to ignore the AccessControlException completely.  He wants something like
the  behavior of GNU ls, which will print out an error message to stderr about paths it can't
access, but still continue to list the remaining paths which it can.  Currently, bob can't
get this-- he simply gets an IOException and *no* globStatus results.  Ignoring the error
completely also seems like the wrong thing to do as well, though.  Hence the {{PathErrorHandler}},
which allows more sophisticated error handling here.

Symlinks make this more important, since you have errors like {{UnresolvedPathException}},
which anyone can cause simply by creating a dangling symlink.  We don't want directories with
dangling symlinks to become un-globbable.  Obviously, the default error handlers will provide
the existing behavior for {{listStatus}} and {{globStatus}}.
                
> new APIs for listStatus and globStatus to deal with symlinks
> ------------------------------------------------------------
>
>                 Key: HADOOP-9972
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9972
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 2.1.1-beta
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>
> Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with
symlinks.  The issue is that code has been written which is incompatible with the existence
of things which are not files or directories.  For example,
> there is a lot of code out there that looks at FileStatus#isFile, and
> if it returns false, assumes that what it is looking at is a
> directory.  In the case of a symlink, this assumption is incorrect.
> It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}}
be fully resolving symlinks, and ignoring dangling ones.  This will prevent incompatibility
with existing MR jobs and other HDFS users.  We should also add new versions of listStatus
and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message