hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory
Date Fri, 06 Sep 2013 14:57:53 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760241#comment-13760241
] 

Jason Lowe commented on HADOOP-9912:
------------------------------------

Thanks for the behavior matrix, Colin.  I think the issue of compatible/incompatible is about
*expectations* of the FileSystem listStatus API.  FileSystem hasn't supported symlinks until
very recently, and as a result I doubt many, if any, symlinks were being used in HDFS.  It
required custom Java code to manipulate them and nothing written with FileSystem would work
with them.

I am under the impression that we want symlinks to "just work" for the majority of existing
applications.  If that's the case then we need to avoid exposing raw symlinks as results from
the existing FileSystem APIs as callers aren't expecting to deal with them.  A directory walker
is the classic case of this, as it will expect isDir() to tell it when to traverse subdirectories
and symlinks to directories breaks that assumption.

A proposal to keep the existing FileSystem users working with symlinks in HDFS:

- listStatus resolves symlinks when possible.  If the symlink cannot be resolved (e.g.: dangling,
permission-restricted target path, etc.) it will return the status of the symlink since it
cannot stat the symlink target.
- A separate API, either an overload of listStatus with an extra flag to control symlink resolution
or a separate listLinkStatus, can be used for callers that always want the symlink status
and not the status of the symlink target.  I would not expect the majority of existing listStatus
callers to want to see symlinks and have to resolve them.  This is akin to the getFileStatus/getFileLinkStatus
pairing.  Existing callers of getFileStatus never expected symlinks so that's why it always
follows them and a new API was added to examine the symlink itself rather than adding a new
status API to always follow the symlink.

For me it's all about what callers are expecting FileSystem's listStatus semantics to be.
 I believe that existing callers are *not* expecting symlinks to be returned since FileSystem
never supported them in the past and I doubt they were being used in HDFS in general.  Most
callers are expecting listStatus to be a readdir and stat, and stat follows symlinks.  If
listStatus does not resolve symlinks then it breaks existing Pig and MapReduce code, and I
believe that's an indication it will break a lot more code out there.  The code that breaks
can be updated to understand symlinks, but I believe in practice that means symlinks to directories
will be fragile for a long time.  Each tool that encounters them will have to be updated to
check for them and behave accordingly.
                
> globStatus of a symlink to a directory does not report symlink as a directory
> -----------------------------------------------------------------------------
>
>                 Key: HADOOP-9912
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9912
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Priority: Blocker
>         Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, old-hdfs.txt,
old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the resulting FileStatus
as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message