Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: 
 <CAOcnVr21mNpJbsyu8M4Bc+xnRThbPxCkGb9rOwQ_YGWLKr9wfw@mail.gmail.com>
References: 
 <CANOpnzrMOGVoJgzDGO0JZSDa2OaUj=vH=9j5A4S0pW5VV9BznA@mail.gmail.com>
	<CAOcnVr21mNpJbsyu8M4Bc+xnRThbPxCkGb9rOwQ_YGWLKr9wfw@mail.gmail.com>
Date: Thu, 25 Oct 2012 15:52:51 -0400
Message-ID: 
 <CANOpnzoy1gZTW7oBoc+haXzHhrMuP4oBPMFMWNBiaDWMFpmoWg@mail.gmail.com>
Subject: Re: File Permissions on s3 FileSystem
From: Parth Savani <parth@sensenetworks.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=00248c7118fd5ec95404cce78c8c

--00248c7118fd5ec95404cce78c8c
Content-Type: text/plain; charset=ISO-8859-1

Hello Harsh,
         I am following steps based on this link:
http://wiki.apache.org/hadoop/AmazonS3

When i run the job, I am seeing that the hadoop places all the jars
required for the job on s3. However, when it tries to run the job, it
complains
The ownership on the staging directory
s3n://KEY:VALUE@bucket/tmp/ec2-user/.staging
is not as expected. It is owned by   The directory must be owned by the
submitter ec2-user or by ec2-user

Some people have seemed to solved this problem of permissions here ->
https://issues.apache.org/jira/browse/HDFS-1333
But they have made changes to some hadoop java classes and I wonder if
there's an easy workaround.


On Wed, Oct 24, 2012 at 12:21 AM, Harsh J <harsh@cloudera.com> wrote:

> Hey Parth,
>
> I don't think its possible to run MR by basing the FS over S3
> completely. You can use S3 for I/O for your files, but your
> fs.default.name (or fs.defaultFS) must be either file:/// or hdfs://
> filesystems. This way, your MR framework can run/distribute its files
> well, and also still be able to process S3 URLs passed as input or
> output locations.
>
> On Tue, Oct 23, 2012 at 11:02 PM, Parth Savani <parth@sensenetworks.com>
> wrote:
> > Hello Everyone,
> >         I am trying to run a hadoop job with s3n as my filesystem.
> > I changed the following properties in my hdfs-site.xml
> >
> > fs.default.name=s3n://KEY:VALUE@bucket/
> > mapreduce.jobtracker.staging.root.dir=s3n://KEY:VALUE@bucket/tmp
> >
> > When i run the job from ec2, I get the following error
> >
> > The ownership on the staging directory
> > s3n://KEY:VALUE@bucket/tmp/ec2-user/.staging is not as expected. It is
> owned
> > by   The directory must be owned by the submitter ec2-user or by ec2-user
> > at
> >
> org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:113)
> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:844)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:415)
> > at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
> > at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:844)
> > at org.apache.hadoop.mapreduce.Job.submit(Job.java:481)
> >
> > I am using cloudera CDH4 hadoop distribution. The error is thrown from
> > JobSubmissionFiles.java class
> >  public static Path getStagingDir(JobClient client, Configuration conf)
> >   throws IOException, InterruptedException {
> >     Path stagingArea = client.getStagingAreaDir();
> >     FileSystem fs = stagingArea.getFileSystem(conf);
> >     String realUser;
> >     String currentUser;
> >     UserGroupInformation ugi = UserGroupInformation.getLoginUser();
> >     realUser = ugi.getShortUserName();
> >     currentUser =
> UserGroupInformation.getCurrentUser().getShortUserName();
> >     if (fs.exists(stagingArea)) {
> >       FileStatus fsStatus = fs.getFileStatus(stagingArea);
> >       String owner = fsStatus.getOwner();
> >       if (!(owner.equals(currentUser) || owner.equals(realUser))) {
> >          throw new IOException("The ownership on the staging directory "
> +
> >                       stagingArea + " is not as expected. " +
> >                       "It is owned by " + owner + ". The directory must
> " +
> >                       "be owned by the submitter " + currentUser + " or
> " +
> >                       "by " + realUser);
> >       }
> >       if (!fsStatus.getPermission().equals(JOB_DIR_PERMISSION)) {
> >         LOG.info("Permissions on staging directory " + stagingArea + "
> are "
> > +
> >           "incorrect: " + fsStatus.getPermission() + ". Fixing
> permissions "
> > +
> >           "to correct value " + JOB_DIR_PERMISSION);
> >         fs.setPermission(stagingArea, JOB_DIR_PERMISSION);
> >       }
> >     } else {
> >       fs.mkdirs(stagingArea,
> >           new FsPermission(JOB_DIR_PERMISSION));
> >     }
> >     return stagingArea;
> >   }
> >
> >
> >
> > I think my job calls getOwner() which returns NULL since s3 does not have
> > file permissions which results in the IO exception that i am getting.
> >
> > Any workaround for this? Any idea how i could you s3 as the filesystem
> with
> > hadoop on distributed mode?
>
>
>
> --
> Harsh J
>

--00248c7118fd5ec95404cce78c8c
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hello Harsh,<div>=A0 =A0 =A0 =A0 =A0I am following steps based on this link=
:=A0<a href=3D"http://wiki.apache.org/hadoop/AmazonS3">http://wiki.apache.o=
rg/hadoop/AmazonS3</a></div><div><br></div><div>When i run the job, I am se=
eing that the hadoop places all the jars required for the job on s3. Howeve=
r, when it tries to run the job, it complains=A0</div>
<div><span style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-s=
ize:13px;background-color:rgb(255,255,255)">The ownership on the staging di=
rectory s3n://KEY:VALUE@bucket/tmp/</span><span style=3D"color:rgb(34,34,34=
);font-family:arial,sans-serif;font-size:13px;background-color:rgb(255,255,=
255)">ec2-user/.staging is not as expected. It is owned by =A0 The director=
y must be owned by the submitter ec2-user or by ec2-user</span></div>
<div><font color=3D"#222222" face=3D"arial, sans-serif"><br></font></div><d=
iv><font color=3D"#222222" face=3D"arial, sans-serif">Some people have seem=
ed to solved this problem of permissions here -&gt; =A0<a href=3D"https://i=
ssues.apache.org/jira/browse/HDFS-1333">https://issues.apache.org/jira/brow=
se/HDFS-1333</a></font></div>
<div><font color=3D"#222222" face=3D"arial, sans-serif">But they have made =
changes to some hadoop java classes and I wonder if there&#39;s an easy wor=
karound.=A0</font></div><div><font color=3D"#222222" face=3D"arial, sans-se=
rif"><br>
</font><br><div class=3D"gmail_quote">On Wed, Oct 24, 2012 at 12:21 AM, Har=
sh J <span dir=3D"ltr">&lt;<a href=3D"mailto:harsh@cloudera.com" target=3D"=
_blank">harsh@cloudera.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex">
Hey Parth,<br>
<br>
I don&#39;t think its possible to run MR by basing the FS over S3<br>
completely. You can use S3 for I/O for your files, but your<br>
<a href=3D"http://fs.default.name" target=3D"_blank">fs.default.name</a> (o=
r fs.defaultFS) must be either file:/// or hdfs://<br>
filesystems. This way, your MR framework can run/distribute its files<br>
well, and also still be able to process S3 URLs passed as input or<br>
output locations.<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
On Tue, Oct 23, 2012 at 11:02 PM, Parth Savani &lt;<a href=3D"mailto:parth@=
sensenetworks.com">parth@sensenetworks.com</a>&gt; wrote:<br>
&gt; Hello Everyone,<br>
&gt; =A0 =A0 =A0 =A0 I am trying to run a hadoop job with s3n as my filesys=
tem.<br>
&gt; I changed the following properties in my hdfs-site.xml<br>
&gt;<br>
&gt; <a href=3D"http://fs.default.name" target=3D"_blank">fs.default.name</=
a>=3Ds3n://KEY:VALUE@bucket/<br>
&gt; mapreduce.jobtracker.staging.root.dir=3Ds3n://KEY:VALUE@bucket/tmp<br>
&gt;<br>
&gt; When i run the job from ec2, I get the following error<br>
&gt;<br>
&gt; The ownership on the staging directory<br>
&gt; s3n://KEY:VALUE@bucket/tmp/ec2-user/.staging is not as expected. It is=
 owned<br>
&gt; by =A0 The directory must be owned by the submitter ec2-user or by ec2=
-user<br>
&gt; at<br>
&gt; org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmis=
sionFiles.java:113)<br>
&gt; at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)<br>
&gt; at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:844)<br>
&gt; at java.security.AccessController.doPrivileged(Native Method)<br>
&gt; at javax.security.auth.Subject.doAs(Subject.java:415)<br>
&gt; at<br>
&gt; org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformat=
ion.java:1232)<br>
&gt; at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java=
:844)<br>
&gt; at org.apache.hadoop.mapreduce.Job.submit(Job.java:481)<br>
&gt;<br>
&gt; I am using cloudera CDH4 hadoop distribution. The error is thrown from=
<br>
&gt; JobSubmissionFiles.java class<br>
&gt; =A0public static Path getStagingDir(JobClient client, Configuration co=
nf)<br>
&gt; =A0 throws IOException, InterruptedException {<br>
&gt; =A0 =A0 Path stagingArea =3D client.getStagingAreaDir();<br>
&gt; =A0 =A0 FileSystem fs =3D stagingArea.getFileSystem(conf);<br>
&gt; =A0 =A0 String realUser;<br>
&gt; =A0 =A0 String currentUser;<br>
&gt; =A0 =A0 UserGroupInformation ugi =3D UserGroupInformation.getLoginUser=
();<br>
&gt; =A0 =A0 realUser =3D ugi.getShortUserName();<br>
&gt; =A0 =A0 currentUser =3D UserGroupInformation.getCurrentUser().getShort=
UserName();<br>
&gt; =A0 =A0 if (fs.exists(stagingArea)) {<br>
&gt; =A0 =A0 =A0 FileStatus fsStatus =3D fs.getFileStatus(stagingArea);<br>
&gt; =A0 =A0 =A0 String owner =3D fsStatus.getOwner();<br>
&gt; =A0 =A0 =A0 if (!(owner.equals(currentUser) || owner.equals(realUser))=
) {<br>
&gt; =A0 =A0 =A0 =A0 =A0throw new IOException(&quot;The ownership on the st=
aging directory &quot; +<br>
&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 stagingArea + &quot; is no=
t as expected. &quot; +<br>
&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &quot;It is owned by &quot=
; + owner + &quot;. The directory must &quot; +<br>
&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &quot;be owned by the subm=
itter &quot; + currentUser + &quot; or &quot; +<br>
&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &quot;by &quot; + realUser=
);<br>
&gt; =A0 =A0 =A0 }<br>
&gt; =A0 =A0 =A0 if (!fsStatus.getPermission().equals(JOB_DIR_PERMISSION)) =
{<br>
&gt; =A0 =A0 =A0 =A0 LOG.info(&quot;Permissions on staging directory &quot;=
 + stagingArea + &quot; are &quot;<br>
&gt; +<br>
&gt; =A0 =A0 =A0 =A0 =A0 &quot;incorrect: &quot; + fsStatus.getPermission()=
 + &quot;. Fixing permissions &quot;<br>
&gt; +<br>
&gt; =A0 =A0 =A0 =A0 =A0 &quot;to correct value &quot; + JOB_DIR_PERMISSION=
);<br>
&gt; =A0 =A0 =A0 =A0 fs.setPermission(stagingArea, JOB_DIR_PERMISSION);<br>
&gt; =A0 =A0 =A0 }<br>
&gt; =A0 =A0 } else {<br>
&gt; =A0 =A0 =A0 fs.mkdirs(stagingArea,<br>
&gt; =A0 =A0 =A0 =A0 =A0 new FsPermission(JOB_DIR_PERMISSION));<br>
&gt; =A0 =A0 }<br>
&gt; =A0 =A0 return stagingArea;<br>
&gt; =A0 }<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; I think my job calls getOwner() which returns NULL since s3 does not h=
ave<br>
&gt; file permissions which results in the IO exception that i am getting.<=
br>
&gt;<br>
&gt; Any workaround for this? Any idea how i could you s3 as the filesystem=
 with<br>
&gt; hadoop on distributed mode?<br>
<br>
<br>
<br>
</div></div><span class=3D"HOEnZb"><font color=3D"#888888">--<br>
Harsh J<br>
</font></span></blockquote></div><br></div>

--00248c7118fd5ec95404cce78c8c--