Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Subject: Re: S3 Hadoop FileSystems
MIME-Version: 1.0
From: Chris Nauroth <cnauroth@hortonworks.com>
To: Elliot West <teabot@gmail.com>, "user@hadoop.apache.org"
	<user@hadoop.apache.org>
Thread-Topic: S3 Hadoop FileSystems
Thread-Index: AQHRoUXD7BrR7nK7gkatoXQqLdITep+h6A4AgAExSYA=
Date: Sat, 30 Apr 2016 22:16:31 +0000
Message-ID: <D34A7C5F.41DD7%cnauroth@hortonworks.com>
References: 
 <CAC3gpCaUzCbyGcAcfzZaskFhPn9d=9-EK10cmU4R9ORmCFZ6Xg@mail.gmail.com>
 <D34979B9.41D44%cnauroth@hortonworks.com>
In-Reply-To: <D34979B9.41D44%cnauroth@hortonworks.com>
Accept-Language: en-US
Content-Language: en-US
x-ms-exchange-messagesentrepresentingtype: 1
x-ms-exchange-transport-fromentityheader: Hosted
x-originating-ip: [73.254.66.69]
x-source-routing-agent: Processed
Content-Type: multipart/alternative;
	boundary="_000_D34A7C5F41DD7cnaurothhortonworkscom_"

--_000_D34A7C5F41DD7cnaurothhortonworkscom_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

I have some more information regarding MD5 verification with s3a.  It turns=
 out that s3a does have the MD5 verification.  It's just not visible from r=
eading the s3a code, because the MD5 verification is performed entirely wit=
hin the AWS SDK library dependency.  If you're interested in more details o=
n how this works, or if you want to follow any further discussion on this t=
opic, then please take a look at the comments on HADOOP-13076.

--Chris Nauroth

From: Chris Nauroth <cnauroth@hortonworks.com<mailto:cnauroth@hortonworks.c=
om>>
Date: Friday, April 29, 2016 at 9:03 PM
To: Elliot West <teabot@gmail.com<mailto:teabot@gmail.com>>, "user@hadoop.a=
pache.org<mailto:user@hadoop.apache.org>" <user@hadoop.apache.org<mailto:us=
er@hadoop.apache.org>>
Subject: Re: S3 Hadoop FileSystems

Hello Elliot,

The current state of support for the various S3 file system implementations=
 within the Apache Hadoop community can be summed up as follows:

s3: Soon to be deprecated, not actively maintained, appears to not work rel=
iably at all in recent versions.
s3n: Not yet on its way to deprecation, but also not actively maintained.
s3a: This is seen as the direction forward for S3 integration, so this is w=
here Hadoop contributors are currently focusing their energy.

Regarding interoperability with EMR, I can't speak from any of my own exper=
ience on how to achieve this.  We know that EMR runs custom code different =
from what you'll see in the Apache repos.  I think that creates a risk for =
interop.  My only suggestion would be to experiment and make sure to test a=
ny of your interop scenarios end-to-end very thoroughly.

As you noticed, s3n no longer has a 5 GB limitation.  Issue HADOOP-9454 int=
roduced support for files larger than 5 GB by using multi-part upload.  Thi=
s patch was released in Apache Hadoop 2.4.0.

Regarding lack of MD5 verification in s3a, I believe that is just an oversi=
ght, not an intentional design choice.  I filed HADOOP-13076 to track addin=
g this feature in s3a.

--Chris Nauroth

From: Elliot West <teabot@gmail.com<mailto:teabot@gmail.com>>
Date: Thursday, April 28, 2016 at 5:01 AM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.ap=
ache.org<mailto:user@hadoop.apache.org>>
Subject: S3 Hadoop FileSystems

Hello,

I'm working on a project that moves data from HDFS file systems into S3 for=
 analysis with Hive on EMR. Recently I've become quite confused with the st=
ate of play regarding the different FileSystems: s3, s3n, and s3a. For my u=
se case I require the following:

  *   Support for the transfer of very large files.
  *   MD5 checks on copy operations to provide data verification.
  *   Excellent compatibility within an EMR/Hive environment.

To move data between clusters it would seem that current versions of the Na=
tiveS3FileSystem are my best bet; It appears that only s3n provides MD5 che=
cking<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hado=
op-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemSt=
ore.java#L120>. It is often cited that s3n does not support files over 5GB =
but I can find no indication of such a limitation in the source code, in fa=
ct I see that it switches over to multi-part upload for larger files<https:=
//github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/m=
ain/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L13=
0>. So, has this limitation been removed in s3n?

Within EMR Amazon appear to recommend s3, support s3n, and advise against s=
3a<http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-p=
lan-file-systems.html>. So yet again s3n would appear to win out here too? =
I assume that the s3n implementation available in EMR is different to that =
in Apache Hadoop? I find it hard to imagine that AWS would use JetS3t inste=
ad of their own AWS Java client, but perhaps they do?

Finally, could I use NativeS3FileSystem to perform the actual transfer on m=
y Apache Hadoop cluster but then rewrite the table locations in my EMR Hive=
 metastore to use the s3:// protocol prefix? Could that work?

I'd appreciate any light that can be shed on these questions, and any advic=
e regarding my reasoning behind my proposal to use s3n for this particular =
use case.

Thanks,

Elliot.


--_000_D34A7C5F41DD7cnaurothhortonworkscom_
Content-Type: text/html; charset="us-ascii"
Content-ID: <08D8FACE688FC2469F083F5326107D71@exch080.serverpod.net>
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
</head>
<body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-lin=
e-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-fami=
ly: Calibri, sans-serif;">
<div>
<div>
<div>I have some more information regarding MD5 verification with s3a. &nbs=
p;It turns out that s3a does have the MD5 verification. &nbsp;It's just not=
 visible from reading the s3a code, because the MD5 verification is perform=
ed entirely within the AWS SDK library dependency.
 &nbsp;If you're interested in more details on how this works, or if you wa=
nt to follow any further discussion on this topic, then please take a look =
at the comments on HADOOP-13076.</div>
<div><br>
</div>
<div><font class=3D"Apple-style-span" color=3D"#000000"><font class=3D"Appl=
e-style-span" face=3D"Calibri">--Chris Nauroth</font></font></div>
</div>
</div>
<div><br>
</div>
<span id=3D"OLK_SRC_BODY_SECTION">
<div style=3D"font-family:Calibri; font-size:11pt; text-align:left; color:b=
lack; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM:=
 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid;=
 BORDER-RIGHT: medium none; PADDING-TOP: 3pt">
<span style=3D"font-weight:bold">From: </span>Chris Nauroth &lt;<a href=3D"=
mailto:cnauroth@hortonworks.com">cnauroth@hortonworks.com</a>&gt;<br>
<span style=3D"font-weight:bold">Date: </span>Friday, April 29, 2016 at 9:0=
3 PM<br>
<span style=3D"font-weight:bold">To: </span>Elliot West &lt;<a href=3D"mail=
to:teabot@gmail.com">teabot@gmail.com</a>&gt;, &quot;<a href=3D"mailto:user=
@hadoop.apache.org">user@hadoop.apache.org</a>&quot; &lt;<a href=3D"mailto:=
user@hadoop.apache.org">user@hadoop.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Subject: </span>Re: S3 Hadoop FileSystems<=
br>
</div>
<div><br>
</div>
<div>
<div style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line=
-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-famil=
y: Calibri, sans-serif;">
<div>
<div>
<div>Hello Elliot,</div>
<div><br>
</div>
<div>The current state of support for the various S3 file system implementa=
tions within the Apache Hadoop community can be summed up as follows:</div>
<div><br>
</div>
<div>s3: Soon to be deprecated, not actively maintained, appears to not wor=
k reliably at all in recent versions.</div>
<div>s3n: Not yet on its way to deprecation, but also not actively maintain=
ed.</div>
<div>s3a: This is seen as the direction forward for S3 integration, so this=
 is where Hadoop contributors are currently focusing their energy.</div>
<div><br>
</div>
<div>Regarding interoperability with EMR, I can't speak from any of my own =
experience on how to achieve this. &nbsp;We know that EMR runs custom code =
different from what you'll see in the Apache repos. &nbsp;I think that crea=
tes a risk for interop. &nbsp;My only suggestion
 would be to experiment and make sure to test any of your interop scenarios=
 end-to-end very thoroughly.</div>
<div><br>
</div>
<div>As you noticed, s3n no longer has a 5 GB limitation. &nbsp;Issue HADOO=
P-9454 introduced support for files larger than 5 GB by using multi-part up=
load. &nbsp;This patch was released in Apache Hadoop 2.4.0.</div>
<div><br>
</div>
<div>Regarding lack of MD5 verification in s3a, I believe that is just an o=
versight, not an intentional design choice. &nbsp;I filed HADOOP-13076 to t=
rack adding this feature in s3a.</div>
<div><br>
</div>
<div><font class=3D"Apple-style-span" color=3D"#000000"><font class=3D"Appl=
e-style-span" face=3D"Calibri">--Chris Nauroth</font></font></div>
</div>
</div>
<div><br>
</div>
<span id=3D"OLK_SRC_BODY_SECTION">
<div style=3D"font-family:Calibri; font-size:11pt; text-align:left; color:b=
lack; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM:=
 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid;=
 BORDER-RIGHT: medium none; PADDING-TOP: 3pt">
<span style=3D"font-weight:bold">From: </span>Elliot West &lt;<a href=3D"ma=
ilto:teabot@gmail.com">teabot@gmail.com</a>&gt;<br>
<span style=3D"font-weight:bold">Date: </span>Thursday, April 28, 2016 at 5=
:01 AM<br>
<span style=3D"font-weight:bold">To: </span>&quot;<a href=3D"mailto:user@ha=
doop.apache.org">user@hadoop.apache.org</a>&quot; &lt;<a href=3D"mailto:use=
r@hadoop.apache.org">user@hadoop.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Subject: </span>S3 Hadoop FileSystems<br>
</div>
<div><br>
</div>
<div>
<div>
<div dir=3D"ltr">Hello,
<div><br>
</div>
<div>I'm working on a project that moves data from HDFS file systems into S=
3 for analysis with Hive on EMR. Recently I've become quite confused with t=
he state of play regarding the different FileSystems: s3, s3n, and s3a. For=
 my use case I require the following:</div>
<div>
<ul>
<li>Support for the transfer of very large files.</li><li>MD5 checks on cop=
y operations to provide data verification.</li><li>Excellent compatibility =
within an EMR/Hive environment.</li></ul>
<div>To move data between clusters it would seem that current versions of t=
he <font face=3D"monospace,monospace">
NativeS3FileSystem</font> are my best bet; It appears that only <a href=3D"=
https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws=
/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.ja=
va#L120">
s3n provides MD5 checking</a>. It is often cited that s3n does not support =
files over 5GB but I can find no indication of such a limitation in the sou=
rce code, in fact I see that it
<a href=3D"https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools=
/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSys=
temStore.java#L130">
switches over to multi-part upload for larger files</a>. So, has this limit=
ation been removed in s3n?</div>
</div>
<div><br>
</div>
<div>Within EMR Amazon appear to <a href=3D"http://docs.aws.amazon.com/Elas=
ticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html">
recommend s3, support s3n, and advise against s3a</a>. So yet again s3n wou=
ld appear to win out here too? I assume that the s3n implementation availab=
le in EMR is different to that in Apache Hadoop? I find it hard to imagine =
that AWS would use JetS3t instead
 of their own AWS Java client, but perhaps they do?</div>
<div><br>
</div>
<div>Finally, could I use <font face=3D"monospace,monospace">NativeS3FileSy=
stem</font> to perform the actual transfer on my Apache Hadoop cluster but =
then rewrite the table locations in my EMR Hive metastore to use the
<font face=3D"monospace,monospace">s3://</font> protocol prefix? Could that=
 work?</div>
<div><br>
</div>
<div>I'd appreciate any light that can be shed on these questions, and any =
advice regarding my reasoning behind my proposal to use s3n for this partic=
ular use case.</div>
<div><br>
</div>
<div>Thanks,</div>
<div><br>
</div>
<div>Elliot.</div>
<div><br>
</div>
<div><br>
</div>
</div>
</div>
</div>
</span></div>
</div>
</span>
</body>
</html>

--_000_D34A7C5F41DD7cnaurothhortonworkscom_--