Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Subject: Re: S3 Hadoop FileSystems
MIME-Version: 1.0
From: Chris Nauroth <cnauroth@hortonworks.com>
To: Elliot West <teabot@gmail.com>
CC: "user@hadoop.apache.org" <user@hadoop.apache.org>
Thread-Topic: S3 Hadoop FileSystems
Thread-Index: AQHRoUXD7BrR7nK7gkatoXQqLdITep+h6A4AgAExSYCABFvXAIAAAYIA
Date: Tue, 3 May 2016 16:55:44 +0000
Message-ID: <D34E25A6.420A7%cnauroth@hortonworks.com>
References: <CAC3gpCaUzCbyGcAcfzZaskFhPn9d=9-EK10cmU4R9ORmCFZ6Xg@mail.gmail.com>
 <D34979B9.41D44%cnauroth@hortonworks.com>
 <D34A7C5F.41DD7%cnauroth@hortonworks.com>
 <CAC3gpCZqG+SxPkDM55_-PAXHJXXg+09AwukJEBaPVi5iApLgew@mail.gmail.com>
In-Reply-To: <CAC3gpCZqG+SxPkDM55_-PAXHJXXg+09AwukJEBaPVi5iApLgew@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
x-ms-exchange-messagesentrepresentingtype: 1
x-ms-exchange-transport-fromentityheader: Hosted
x-originating-ip: [50.248.208.113]
x-source-routing-agent: Processed
Content-Type: multipart/alternative;
	boundary="_000_D34E25A6420A7cnaurothhortonworkscom_"
archived-at: Tue, 03 May 2016 16:55:55 -0000

--_000_D34E25A6420A7cnaurothhortonworkscom_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hello Elliot,

You're welcome, and the time was not wasted at all.  This is exactly the ki=
nd of valuable discussion that we like to share on the user@ list.  As an o=
utcome, we now have a more definitive answer about how MD5 verification wor=
ks in s3a.  Thank you for starting the discussion.

--Chris Nauroth

From: Elliot West <teabot@gmail.com<mailto:teabot@gmail.com>>
Date: Tuesday, May 3, 2016 at 2:50 AM
To: Chris Nauroth <cnauroth@hortonworks.com<mailto:cnauroth@hortonworks.com=
>>
Cc: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.ap=
ache.org<mailto:user@hadoop.apache.org>>
Subject: Re: S3 Hadoop FileSystems

Thank you,

I had a look at HADOOP-13076 and associated codes snippets in the AWS SDK. =
I agree that the MD5 check does appear to be taking place after all. I appr=
eciate your efforts in looking into that matter and raising the ticket.

Apologies for any time wasting that I may have caused.

Cheers - Elliot.

On 30 April 2016 at 23:16, Chris Nauroth <cnauroth@hortonworks.com<mailto:c=
nauroth@hortonworks.com>> wrote:
I have some more information regarding MD5 verification with s3a.  It turns=
 out that s3a does have the MD5 verification.  It's just not visible from r=
eading the s3a code, because the MD5 verification is performed entirely wit=
hin the AWS SDK library dependency.  If you're interested in more details o=
n how this works, or if you want to follow any further discussion on this t=
opic, then please take a look at the comments on HADOOP-13076.

--Chris Nauroth

From: Chris Nauroth <cnauroth@hortonworks.com<mailto:cnauroth@hortonworks.c=
om>>
Date: Friday, April 29, 2016 at 9:03 PM
To: Elliot West <teabot@gmail.com<mailto:teabot@gmail.com>>, "user@hadoop.a=
pache.org<mailto:user@hadoop.apache.org>" <user@hadoop.apache.org<mailto:us=
er@hadoop.apache.org>>
Subject: Re: S3 Hadoop FileSystems

Hello Elliot,

The current state of support for the various S3 file system implementations=
 within the Apache Hadoop community can be summed up as follows:

s3: Soon to be deprecated, not actively maintained, appears to not work rel=
iably at all in recent versions.
s3n: Not yet on its way to deprecation, but also not actively maintained.
s3a: This is seen as the direction forward for S3 integration, so this is w=
here Hadoop contributors are currently focusing their energy.

Regarding interoperability with EMR, I can't speak from any of my own exper=
ience on how to achieve this.  We know that EMR runs custom code different =
from what you'll see in the Apache repos.  I think that creates a risk for =
interop.  My only suggestion would be to experiment and make sure to test a=
ny of your interop scenarios end-to-end very thoroughly.

As you noticed, s3n no longer has a 5 GB limitation.  Issue HADOOP-9454 int=
roduced support for files larger than 5 GB by using multi-part upload.  Thi=
s patch was released in Apache Hadoop 2.4.0.

Regarding lack of MD5 verification in s3a, I believe that is just an oversi=
ght, not an intentional design choice.  I filed HADOOP-13076 to track addin=
g this feature in s3a.

--Chris Nauroth

From: Elliot West <teabot@gmail.com<mailto:teabot@gmail.com>>
Date: Thursday, April 28, 2016 at 5:01 AM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.ap=
ache.org<mailto:user@hadoop.apache.org>>
Subject: S3 Hadoop FileSystems

Hello,

I'm working on a project that moves data from HDFS file systems into S3 for=
 analysis with Hive on EMR. Recently I've become quite confused with the st=
ate of play regarding the different FileSystems: s3, s3n, and s3a. For my u=
se case I require the following:

  *   Support for the transfer of very large files.
  *   MD5 checks on copy operations to provide data verification.
  *   Excellent compatibility within an EMR/Hive environment.

To move data between clusters it would seem that current versions of the Na=
tiveS3FileSystem are my best bet; It appears that only s3n provides MD5 che=
cking<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hado=
op-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemSt=
ore.java#L120>. It is often cited that s3n does not support files over 5GB =
but I can find no indication of such a limitation in the source code, in fa=
ct I see that it switches over to multi-part upload for larger files<https:=
//github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/m=
ain/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L13=
0>. So, has this limitation been removed in s3n?

Within EMR Amazon appear to recommend s3, support s3n, and advise against s=
3a<http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-p=
lan-file-systems.html>. So yet again s3n would appear to win out here too? =
I assume that the s3n implementation available in EMR is different to that =
in Apache Hadoop? I find it hard to imagine that AWS would use JetS3t inste=
ad of their own AWS Java client, but perhaps they do?

Finally, could I use NativeS3FileSystem to perform the actual transfer on m=
y Apache Hadoop cluster but then rewrite the table locations in my EMR Hive=
 metastore to use the s3:// protocol prefix? Could that work?

I'd appreciate any light that can be shed on these questions, and any advic=
e regarding my reasoning behind my proposal to use s3n for this particular =
use case.

Thanks,

Elliot.


--_000_D34E25A6420A7cnaurothhortonworkscom_
Content-Type: text/html; charset="us-ascii"
Content-ID: <56F4F0B17AE83B46B96727394302D5D1@exch080.serverpod.net>
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
</head>
<body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-lin=
e-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-fami=
ly: Calibri, sans-serif;">
<div>
<div>
<div>Hello Elliot,</div>
<div><br>
</div>
<div>You're welcome, and the time was not wasted at all. &nbsp;This is exac=
tly the kind of valuable discussion that we like to share on the user@ list=
. &nbsp;As an outcome, we now have a more definitive answer about how MD5 v=
erification works in s3a. &nbsp;Thank you for starting
 the discussion.</div>
<div><br>
</div>
<div><font class=3D"Apple-style-span" color=3D"#000000"><font class=3D"Appl=
e-style-span" face=3D"Calibri">--Chris Nauroth</font></font></div>
</div>
</div>
<div><br>
</div>
<span id=3D"OLK_SRC_BODY_SECTION">
<div style=3D"font-family:Calibri; font-size:11pt; text-align:left; color:b=
lack; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM:=
 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid;=
 BORDER-RIGHT: medium none; PADDING-TOP: 3pt">
<span style=3D"font-weight:bold">From: </span>Elliot West &lt;<a href=3D"ma=
ilto:teabot@gmail.com">teabot@gmail.com</a>&gt;<br>
<span style=3D"font-weight:bold">Date: </span>Tuesday, May 3, 2016 at 2:50 =
AM<br>
<span style=3D"font-weight:bold">To: </span>Chris Nauroth &lt;<a href=3D"ma=
ilto:cnauroth@hortonworks.com">cnauroth@hortonworks.com</a>&gt;<br>
<span style=3D"font-weight:bold">Cc: </span>&quot;<a href=3D"mailto:user@ha=
doop.apache.org">user@hadoop.apache.org</a>&quot; &lt;<a href=3D"mailto:use=
r@hadoop.apache.org">user@hadoop.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Subject: </span>Re: S3 Hadoop FileSystems<=
br>
</div>
<div><br>
</div>
<div>
<div>
<div dir=3D"ltr">Thank you,
<div><br>
</div>
<div>I had a look at HADOOP-13076 and associated codes snippets in the&nbsp=
;<span style=3D"color: rgb(0, 0, 0); font-family: Calibri, sans-serif; font=
-size: 14px;">AWS SDK.&nbsp;</span>I agree that the MD5 check does appear t=
o be taking place after all. I appreciate your
 efforts in looking into that matter and raising the ticket.</div>
<div><br>
</div>
<div>Apologies for any time wasting that I may have caused.</div>
<div><br>
</div>
<div>Cheers - Elliot.</div>
</div>
<div class=3D"gmail_extra"><br>
<div class=3D"gmail_quote">On 30 April 2016 at 23:16, Chris Nauroth <span d=
ir=3D"ltr">
&lt;<a href=3D"mailto:cnauroth@hortonworks.com" target=3D"_blank">cnauroth@=
hortonworks.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div style=3D"word-wrap:break-word;color:rgb(0,0,0);font-size:14px;font-fam=
ily:Calibri,sans-serif">
<div>
<div>
<div>I have some more information regarding MD5 verification with s3a.&nbsp=
; It turns out that s3a does have the MD5 verification.&nbsp; It's just not=
 visible from reading the s3a code, because the MD5 verification is perform=
ed entirely within the AWS SDK library dependency.
 &nbsp;If you're interested in more details on how this works, or if you wa=
nt to follow any further discussion on this topic, then please take a look =
at the comments on HADOOP-13076.</div>
<div><br>
</div>
<div><font color=3D"#000000"><font face=3D"Calibri">--Chris Nauroth</font><=
/font></div>
</div>
</div>
<div><br>
</div>
<span>
<div style=3D"font-family:Calibri;font-size:11pt;text-align:left;color:blac=
k;BORDER-BOTTOM:medium none;BORDER-LEFT:medium none;PADDING-BOTTOM:0in;PADD=
ING-LEFT:0in;PADDING-RIGHT:0in;BORDER-TOP:#b5c4df 1pt solid;BORDER-RIGHT:me=
dium none;PADDING-TOP:3pt">
<span style=3D"font-weight:bold">From: </span>Chris Nauroth &lt;<a href=3D"=
mailto:cnauroth@hortonworks.com" target=3D"_blank">cnauroth@hortonworks.com=
</a>&gt;<br>
<span style=3D"font-weight:bold">Date: </span>Friday, April 29, 2016 at 9:0=
3 PM<span class=3D""><br>
<span style=3D"font-weight:bold">To: </span>Elliot West &lt;<a href=3D"mail=
to:teabot@gmail.com" target=3D"_blank">teabot@gmail.com</a>&gt;, &quot;<a h=
ref=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.=
org</a>&quot; &lt;<a href=3D"mailto:user@hadoop.apache.org" target=3D"_blan=
k">user@hadoop.apache.org</a>&gt;<br>
</span><span style=3D"font-weight:bold">Subject: </span>Re: S3 Hadoop FileS=
ystems<br>
</div>
<div>
<div class=3D"h5">
<div><br>
</div>
<div>
<div style=3D"word-wrap:break-word;color:rgb(0,0,0);font-size:14px;font-fam=
ily:Calibri,sans-serif">
<div>
<div>
<div>Hello Elliot,</div>
<div><br>
</div>
<div>The current state of support for the various S3 file system implementa=
tions within the Apache Hadoop community can be summed up as follows:</div>
<div><br>
</div>
<div>s3: Soon to be deprecated, not actively maintained, appears to not wor=
k reliably at all in recent versions.</div>
<div>s3n: Not yet on its way to deprecation, but also not actively maintain=
ed.</div>
<div>s3a: This is seen as the direction forward for S3 integration, so this=
 is where Hadoop contributors are currently focusing their energy.</div>
<div><br>
</div>
<div>Regarding interoperability with EMR, I can't speak from any of my own =
experience on how to achieve this.&nbsp; We know that EMR runs custom code =
different from what you'll see in the Apache repos.&nbsp; I think that crea=
tes a risk for interop.&nbsp; My only suggestion
 would be to experiment and make sure to test any of your interop scenarios=
 end-to-end very thoroughly.</div>
<div><br>
</div>
<div>As you noticed, s3n no longer has a 5 GB limitation.&nbsp; Issue HADOO=
P-9454 introduced support for files larger than 5 GB by using multi-part up=
load.&nbsp; This patch was released in Apache Hadoop 2.4.0.</div>
<div><br>
</div>
<div>Regarding lack of MD5 verification in s3a, I believe that is just an o=
versight, not an intentional design choice.&nbsp; I filed HADOOP-13076 to t=
rack adding this feature in s3a.</div>
<div><br>
</div>
<div><font color=3D"#000000"><font face=3D"Calibri">--Chris Nauroth</font><=
/font></div>
</div>
</div>
<div><br>
</div>
<span>
<div style=3D"font-family:Calibri;font-size:11pt;text-align:left;color:blac=
k;BORDER-BOTTOM:medium none;BORDER-LEFT:medium none;PADDING-BOTTOM:0in;PADD=
ING-LEFT:0in;PADDING-RIGHT:0in;BORDER-TOP:#b5c4df 1pt solid;BORDER-RIGHT:me=
dium none;PADDING-TOP:3pt">
<span style=3D"font-weight:bold">From: </span>Elliot West &lt;<a href=3D"ma=
ilto:teabot@gmail.com" target=3D"_blank">teabot@gmail.com</a>&gt;<br>
<span style=3D"font-weight:bold">Date: </span>Thursday, April 28, 2016 at 5=
:01 AM<br>
<span style=3D"font-weight:bold">To: </span>&quot;<a href=3D"mailto:user@ha=
doop.apache.org" target=3D"_blank">user@hadoop.apache.org</a>&quot; &lt;<a =
href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apache=
.org</a>&gt;<br>
<span style=3D"font-weight:bold">Subject: </span>S3 Hadoop FileSystems<br>
</div>
<div><br>
</div>
<div>
<div>
<div dir=3D"ltr">Hello,
<div><br>
</div>
<div>I'm working on a project that moves data from HDFS file systems into S=
3 for analysis with Hive on EMR. Recently I've become quite confused with t=
he state of play regarding the different FileSystems: s3, s3n, and s3a. For=
 my use case I require the following:</div>
<div>
<ul>
<li>Support for the transfer of very large files.</li><li>MD5 checks on cop=
y operations to provide data verification.</li><li>Excellent compatibility =
within an EMR/Hive environment.</li></ul>
<div>To move data between clusters it would seem that current versions of t=
he <font face=3D"monospace,monospace">
NativeS3FileSystem</font> are my best bet; It appears that only <a href=3D"=
https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws=
/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.ja=
va#L120" target=3D"_blank">
s3n provides MD5 checking</a>. It is often cited that s3n does not support =
files over 5GB but I can find no indication of such a limitation in the sou=
rce code, in fact I see that it
<a href=3D"https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools=
/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSys=
temStore.java#L130" target=3D"_blank">
switches over to multi-part upload for larger files</a>. So, has this limit=
ation been removed in s3n?</div>
</div>
<div><br>
</div>
<div>Within EMR Amazon appear to <a href=3D"http://docs.aws.amazon.com/Elas=
ticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html" target=3D"_=
blank">
recommend s3, support s3n, and advise against s3a</a>. So yet again s3n wou=
ld appear to win out here too? I assume that the s3n implementation availab=
le in EMR is different to that in Apache Hadoop? I find it hard to imagine =
that AWS would use JetS3t instead
 of their own AWS Java client, but perhaps they do?</div>
<div><br>
</div>
<div>Finally, could I use <font face=3D"monospace,monospace">NativeS3FileSy=
stem</font> to perform the actual transfer on my Apache Hadoop cluster but =
then rewrite the table locations in my EMR Hive metastore to use the
<font face=3D"monospace,monospace">s3://</font> protocol prefix? Could that=
 work?</div>
<div><br>
</div>
<div>I'd appreciate any light that can be shed on these questions, and any =
advice regarding my reasoning behind my proposal to use s3n for this partic=
ular use case.</div>
<div><br>
</div>
<div>Thanks,</div>
<div><br>
</div>
<div>Elliot.</div>
<div><br>
</div>
<div><br>
</div>
</div>
</div>
</div>
</span></div>
</div>
</div>
</div>
</span></div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</span>
</body>
</html>

--_000_D34E25A6420A7cnaurothhortonworkscom_--