hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elliot West <tea...@gmail.com>
Subject Re: S3 Hadoop FileSystems
Date Tue, 03 May 2016 09:50:14 GMT
Thank you,

I had a look at HADOOP-13076 and associated codes snippets in the AWS SDK. I
agree that the MD5 check does appear to be taking place after all. I
appreciate your efforts in looking into that matter and raising the ticket.

Apologies for any time wasting that I may have caused.

Cheers - Elliot.

On 30 April 2016 at 23:16, Chris Nauroth <cnauroth@hortonworks.com> wrote:

> I have some more information regarding MD5 verification with s3a.  It
> turns out that s3a does have the MD5 verification.  It's just not visible
> from reading the s3a code, because the MD5 verification is performed
> entirely within the AWS SDK library dependency.  If you're interested in
> more details on how this works, or if you want to follow any further
> discussion on this topic, then please take a look at the comments on
> HADOOP-13076.
> --Chris Nauroth
> From: Chris Nauroth <cnauroth@hortonworks.com>
> Date: Friday, April 29, 2016 at 9:03 PM
> To: Elliot West <teabot@gmail.com>, "user@hadoop.apache.org" <
> user@hadoop.apache.org>
> Subject: Re: S3 Hadoop FileSystems
> Hello Elliot,
> The current state of support for the various S3 file system
> implementations within the Apache Hadoop community can be summed up as
> follows:
> s3: Soon to be deprecated, not actively maintained, appears to not work
> reliably at all in recent versions.
> s3n: Not yet on its way to deprecation, but also not actively maintained.
> s3a: This is seen as the direction forward for S3 integration, so this is
> where Hadoop contributors are currently focusing their energy.
> Regarding interoperability with EMR, I can't speak from any of my own
> experience on how to achieve this.  We know that EMR runs custom code
> different from what you'll see in the Apache repos.  I think that creates a
> risk for interop.  My only suggestion would be to experiment and make sure
> to test any of your interop scenarios end-to-end very thoroughly.
> As you noticed, s3n no longer has a 5 GB limitation.  Issue HADOOP-9454
> introduced support for files larger than 5 GB by using multi-part upload.
> This patch was released in Apache Hadoop 2.4.0.
> Regarding lack of MD5 verification in s3a, I believe that is just an
> oversight, not an intentional design choice.  I filed HADOOP-13076 to track
> adding this feature in s3a.
> --Chris Nauroth
> From: Elliot West <teabot@gmail.com>
> Date: Thursday, April 28, 2016 at 5:01 AM
> To: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Subject: S3 Hadoop FileSystems
> Hello,
> I'm working on a project that moves data from HDFS file systems into S3
> for analysis with Hive on EMR. Recently I've become quite confused with the
> state of play regarding the different FileSystems: s3, s3n, and s3a. For my
> use case I require the following:
>    - Support for the transfer of very large files.
>    - MD5 checks on copy operations to provide data verification.
>    - Excellent compatibility within an EMR/Hive environment.
> To move data between clusters it would seem that current versions of the
> NativeS3FileSystem are my best bet; It appears that only s3n provides MD5
> checking
> <https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L120>.
> It is often cited that s3n does not support files over 5GB but I can find
> no indication of such a limitation in the source code, in fact I see that
> it switches over to multi-part upload for larger files
> <https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L130>.
> So, has this limitation been removed in s3n?
> Within EMR Amazon appear to recommend s3, support s3n, and advise against
> s3a
> <http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html>.
> So yet again s3n would appear to win out here too? I assume that the s3n
> implementation available in EMR is different to that in Apache Hadoop? I
> find it hard to imagine that AWS would use JetS3t instead of their own AWS
> Java client, but perhaps they do?
> Finally, could I use NativeS3FileSystem to perform the actual transfer on
> my Apache Hadoop cluster but then rewrite the table locations in my EMR
> Hive metastore to use the s3:// protocol prefix? Could that work?
> I'd appreciate any light that can be shed on these questions, and any
> advice regarding my reasoning behind my proposal to use s3n for this
> particular use case.
> Thanks,
> Elliot.

View raw message