hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-9972) new APIs for listStatus and globStatus to deal with symlinks
Date Thu, 19 Sep 2013 01:22:52 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13771492#comment-13771492

Colin Patrick McCabe commented on HADOOP-9972:

bq. I did some experiments, you can see ls * indeed should error message, but ls */stuff should
not show error message.

I'm afraid that what you're seeing is a bug.  I introduced this bug and I have a patch available
to fix it: https://issues.apache.org/jira/browse/HADOOP-9929

This bug is also not in branch-2.1-beta, so if you'd like to see what the current correct
behavior of globStatus is, try that branch.  You can also try branch-1.

bq. [listLinkStatus proposal]

I want to avoid a combinatorial explosion of function overloads.

Right now we have {{FileSystem#listStatus(Path)}}, {{FileSystem#listStatus(Path, PathFilter)}},
{{FileSystem#listStatus(Path[])}}, and {{FileSystem#listStatus(Path[], PathFilter filter)}}.
 If we create {{listLinkStatus}} as you proposed, that multiplies the number of functions
in FileSystem by 2x, since we have to create a {{listLinkStatus}} equivalent for each of these.

It's much cleaner to fold the {{PathFilter}} into a {{PathOptions}} class, I think.  That
only requires adding two new functions to FileSystem:  {{FileSystem#listStatus(Path, PathOptions)}}
and {{FileSystem#listStatus(Path[], PathOptions)}}.

With regard to {{globStatus}}, you can't build what we want on top of what we have now.  The
first IOException we hit will cause the globStatus function to abort.  Clients like the shell,
which want to handle errors differently, simply don't get a chance to do so with the current

bq. Separate globStatus to glob and getFileStatus seems a more proper way of doing globStatus
rather than add new classes/interface and callback handler, and this is linux practice, should
be more robust

The Linux practice is based on the fact that {{readdir}} only returns path names (i.e. strings)
in POSIX.  In HDFS and other Hadoop filesystems, we don't have {{readdir}}, only {{getFileStatus}}
and {{getFileLinkStatus}}, which return lists of {{FileStatus}} objects.

Since we're already dealing with {{FileStatus}} objects, it makes no sense to call {{getFileStatus}}
on them again-- it's a pure waste of computer time.  You also need some way of handling errors
encountered in globStatus besides ignoring them or aborting the whole glob.  See HADOOP-9929
for more commentary on this issue.
> new APIs for listStatus and globStatus to deal with symlinks
> ------------------------------------------------------------
>                 Key: HADOOP-9972
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9972
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 2.1.1-beta
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
> Based on the discussion in HADOOP-9912, we need new APIs for FileSystem to deal with
symlinks.  The issue is that code has been written which is incompatible with the existence
of things which are not files or directories.  For example,
> there is a lot of code out there that looks at FileStatus#isFile, and
> if it returns false, assumes that what it is looking at is a
> directory.  In the case of a symlink, this assumption is incorrect.
> It seems reasonable to make the default behavior of {{FileSystem#listStatus}} and {{FileSystem#globStatus}}
be fully resolving symlinks, and ignoring dangling ones.  This will prevent incompatibility
with existing MR jobs and other HDFS users.  We should also add new versions of listStatus
and globStatus that allow new, symlink-aware code to deal with symlinks as symlinks.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message