hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-9984) FileSystem#globStatus and FileSystem#listStatus should resolve symlinks by default
Date Thu, 26 Sep 2013 22:28:04 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13779369#comment-13779369

Colin Patrick McCabe commented on HADOOP-9984:

This patch changes {{listStatus}} and {{globStatus}} to resolve symlinks.

If a symlink can't be resolved when doing a {{listStatus}}, a {{DirectoryContentsResolutionException}}
is thrown which contains the resolution exception.  This will usually be {{FileNotFoundException}},
but it doesn't have to be.  It could also be some other error that occurred when trying to
do the RPC.  globber ignores missing files, just as it does now.  The implementation also
makes this necessary, since the globber catches and discards {{FileNotFoundException}}, and
dangling symlinks always manifest as {{FileNotFoundException}}.

I added a new API, {{listLinkStatus}}, which is like {{listStatus}}, but does not resolve
symlinks.  {{listLinkStatus}} is necessary here, since {{globStatus}} needs to glob on file
name, not target name (and this patch changes {{listStatus}} to resolve links, as previously
mentioned.)  Filesystems which don't (yet) support symlinks map {{listLinkStatus}} to {{listStatus}},
similarly to how we handle {{getFileLinkStatus}}.

In Globber, I combined {{authorityFromPath}} and {{schemeFromPath}} into a single function,
{{uriToSchemeAndAuthority}}.  This was necessary since in cases where accept the scheme of
the passed-in path, we also should accept its authority.  So, for example, when processing
{{file:///tmp/*}}, we want the scheme to show up as "file" and the authority to be null. 
Previously, we were getting the scheme as file, but the authority as the default authority,
something like "{{username@host}}".

I fixed all the symlink-related unit tests in {{TestGlobPaths}} and added some more.  I added
a test of listStatus' behavior with dangling links to {{SymlinkBaseTest}}.

Path filters currently match on resolved path, both in {{globStatus}} and {{listStatus}}.
 The rationale is:
* When a filesystem goes from not supporting symlinks to supporting symlinks, we don't want
existing code to break.  If we always apply the path filter on resolved path, the behavior
visible to code will be the same whether or not the filesystem is aware of symlinks or not.
* globbing on resolved path will make possible certain optimizations in the globber when {{resolveLinks=true}}.
* it seems more intuitive filter on the path which you're actually returning.
> FileSystem#globStatus and FileSystem#listStatus should resolve symlinks by default
> ----------------------------------------------------------------------------------
>                 Key: HADOOP-9984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9984
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 2.1.0-beta
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>            Priority: Blocker
>         Attachments: HADOOP-9984.001.patch, HADOOP-9984.003.patch, HADOOP-9984.005.patch
> During the process of adding symlink support to FileSystem, we realized that many existing
HDFS clients would be broken by listStatus and globStatus returning symlinks.  One example
is applications that assume that !FileStatus#isFile implies that the inode is a directory.
 As we discussed in HADOOP-9972 and HADOOP-9912, we should default these APIs to returning
resolved paths.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message