Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Date: Thu, 16 Jun 2016 16:38:05 +0000 (UTC)
From: "Chris Nauroth (JIRA)" <jira@apache.org>
To: common-issues@hadoop.apache.org
Message-ID: <JIRA.12979535.1466026276000.8530.1466095085288@Atlassian.JIRA>
In-Reply-To: <JIRA.12979535.1466026276000@Atlassian.JIRA>
References: <JIRA.12979535.1466026276000@Atlassian.JIRA> <JIRA.12979535.1466026276597@arcas>
Subject: [jira] [Commented] (HADOOP-13278) S3AFileSystem mkdirs does not
 need to validate parent path components
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Thu, 16 Jun 2016 16:38:07 -0000


    [ https://issues.apache.org/jira/browse/HADOOP-13278?page=3Dcom.atlassi=
an.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D15=
334120#comment-15334120 ]=20

Chris Nauroth commented on HADOOP-13278:
----------------------------------------

[~apetresc], I admit I hadn't considered interaction with IAM policies befo=
re, but I definitely see how this could be useful, and it's interesting to =
think about it.  Unfortunately, I don't see a viable way to satisfy the ful=
l range of possible authorization requirements that users have come to expe=
ct from a file system.

For the specific case that we started talking about here (walking up the an=
cestry to verify that there are no pre-existing files), it might work if th=
at policy was changed slightly, so that the user was granted full access to=
 /a/b/c/\*, and also granted read-only access to /\*.  I expect read access=
 would be sufficient for the ancestry-checking logic.  Of course, if you al=
so want to block read access to /, then this policy wouldn't satisfy the re=
quirement.  It would only block write access on /.

Another consideration is handling of what we call a "fake directory", which=
 is a pure metadata object used to indicate the presence of an empty direct=
ory.  For example, consider an administrator allocating a bucket, bootstrap=
ping the initial /a/b/c directory structure by running mkdir, and then appl=
ying the policy I described above.  At this point, S3A has persisted /a/b/c=
 to the bucket as what we call a "fake directory", which is a pure metadata=
 object that indicates the presence of an empty directory.  After the first=
 file put, say /a/b/c/d, S3A no longer needs that pure metadata object to i=
ndicate the presence of the directory.  Instead, the directory exists impli=
citly via the existence of the file /a/b/c/d.  At that point, S3A would cle=
an up the fake directory by deleting /a/b/c.  That implies the user would n=
eed to be granted delete access to /a/b/c itself, not just /a/b/c/*.  Now i=
f we further consider the user deleting /a/b/c/d after that, then S3A needs=
 to recreate the fake directory at /a/b/c, so the user is going to need put=
 access on /a/b/c.

bq. Is this correct? If so, I'm not sure a separate issue is needed; the us=
e case would simply be unsupported and I'll have to move my S3A filesystem =
to a bucket that grants Hadoop/Spark root access.

Definitely the typical usage is to dedicate the whole bucket to persistence=
 of a single S3A file system, with the understanding of the authorization l=
imitations that come with that.  Anyone who has credentials to access the b=
ucket effectively has full access to that whole file system.  This is a kno=
wn limitation, and it's common to other object store file systems like WASB=
 too.  I'm not aware of anyone trying to use IAM policies to restrict acces=
s to a sub-tree.  Certainly it's not something we actively test within the =
project right now, so in that sense, it's unsupported and you'd be treading=
 new ground.

> S3AFileSystem mkdirs does not need to validate parent path components
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-13278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13278
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3, tools
>            Reporter: Adrian Petrescu
>            Priority: Minor
>
> According to S3 semantics, there is no conflict if a bucket contains a ke=
y named {{a/b}} and also a directory named {{a/b/c}}. "Directories" in S3 a=
re, after all, nothing but prefixes.
> However, the {{mkdirs}} call in {{S3AFileSystem}} does go out of its way =
to traverse every parent path component for the directory it's trying to cr=
eate, making sure there's no file with that name. This is suboptimal for th=
ree main reasons:
>  * Wasted API calls, since the client is getting metadata for each path c=
omponent=20
>  * This can cause *major* problems with buckets whose permissions are bei=
ng managed by IAM, where access may not be granted to the root bucket, but =
only to some prefix. When you call {{mkdirs}}, even on a prefix that you ha=
ve access to, the traversal up the path will cause you to eventually hit th=
e root bucket, which will fail with a 403 - even though the directory creat=
ion call would have succeeded.
>  * Some people might actually have a file that matches some other file's =
prefix... I can't see why they would want to do that, but it's not against =
S3's rules.
> I've opened a pull request with a simple patch that just removes this por=
tion of the check. I have tested it with my team's instance of Spark + Luig=
i, and can confirm it works, and resolves the aforementioned permissions is=
sue for a bucket on which we only had prefix access.
> This is my first ticket/pull request against Hadoop, so let me know if I'=
m not following some convention properly :)


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org