Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 4A665200A01 for ; Tue, 3 May 2016 18:55:55 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 4A1531609F4; Tue, 3 May 2016 18:55:55 +0200 (CEST) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 1F60D1609A9 for ; Tue, 3 May 2016 18:55:53 +0200 (CEST) Received: (qmail 12860 invoked by uid 500); 3 May 2016 16:55:52 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 12850 invoked by uid 99); 3 May 2016 16:55:51 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 May 2016 16:55:51 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 5A405C9931 for ; Tue, 3 May 2016 16:55:51 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.798 X-Spam-Level: ** X-Spam-Status: No, score=2.798 tagged_above=-999 required=6.31 tests=[FSL_HELO_BARE_IP_2=1.499, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id vldZocrvSwdP for ; Tue, 3 May 2016 16:55:47 +0000 (UTC) Received: from relayvx11b.securemail.intermedia.net (relayvx11b.securemail.intermedia.net [64.78.52.184]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 730315F298 for ; Tue, 3 May 2016 16:55:47 +0000 (UTC) Received: from securemail.intermedia.net (localhost [127.0.0.1]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by emg-ca-1-1.localdomain (Postfix) with ESMTPS id 92B7353F72; Tue, 3 May 2016 09:55:46 -0700 (PDT) Subject: Re: S3 Hadoop FileSystems MIME-Version: 1.0 x-echoworx-msg-id: ff3ed08d-33a2-4d28-be49-a7302d6434de x-echoworx-emg-received: Tue, 3 May 2016 09:55:46.540 -0700 x-echoworx-message-code-hashed: 6fe3c06550105f4e8fa99afc512b6169b406b53c3dc5daefbd3cc499b0b70146 x-echoworx-action: delivered Received: from 10.254.155.14 ([10.254.155.14]) by emg-ca-1-1 (JAMES SMTP Server 2.3.2) with SMTP ID 785; Tue, 3 May 2016 09:55:46 -0700 (PDT) Received: from MBX080-W4-CO-2.exch080.serverpod.net (unknown [10.224.117.102]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by emg-ca-1-1.localdomain (Postfix) with ESMTPS id 4B1CB53F72; Tue, 3 May 2016 09:55:46 -0700 (PDT) Received: from MBX080-W4-CO-2.exch080.serverpod.net (10.224.117.102) by MBX080-W4-CO-2.exch080.serverpod.net (10.224.117.102) with Microsoft SMTP Server (TLS) id 15.0.1130.7; Tue, 3 May 2016 09:55:45 -0700 Received: from MBX080-W4-CO-2.exch080.serverpod.net ([10.224.117.102]) by mbx080-w4-co-2.exch080.serverpod.net ([10.224.117.102]) with mapi id 15.00.1130.005; Tue, 3 May 2016 09:55:45 -0700 From: Chris Nauroth To: Elliot West CC: "user@hadoop.apache.org" Thread-Topic: S3 Hadoop FileSystems Thread-Index: AQHRoUXD7BrR7nK7gkatoXQqLdITep+h6A4AgAExSYCABFvXAIAAAYIA Date: Tue, 3 May 2016 16:55:44 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-messagesentrepresentingtype: 1 x-ms-exchange-transport-fromentityheader: Hosted x-originating-ip: [50.248.208.113] x-source-routing-agent: Processed Content-Type: multipart/alternative; boundary="_000_D34E25A6420A7cnaurothhortonworkscom_" archived-at: Tue, 03 May 2016 16:55:55 -0000 --_000_D34E25A6420A7cnaurothhortonworkscom_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hello Elliot, You're welcome, and the time was not wasted at all. This is exactly the ki= nd of valuable discussion that we like to share on the user@ list. As an o= utcome, we now have a more definitive answer about how MD5 verification wor= ks in s3a. Thank you for starting the discussion. --Chris Nauroth From: Elliot West > Date: Tuesday, May 3, 2016 at 2:50 AM To: Chris Nauroth > Cc: "user@hadoop.apache.org" > Subject: Re: S3 Hadoop FileSystems Thank you, I had a look at HADOOP-13076 and associated codes snippets in the AWS SDK. = I agree that the MD5 check does appear to be taking place after all. I appr= eciate your efforts in looking into that matter and raising the ticket. Apologies for any time wasting that I may have caused. Cheers - Elliot. On 30 April 2016 at 23:16, Chris Nauroth > wrote: I have some more information regarding MD5 verification with s3a. It turns= out that s3a does have the MD5 verification. It's just not visible from r= eading the s3a code, because the MD5 verification is performed entirely wit= hin the AWS SDK library dependency. If you're interested in more details o= n how this works, or if you want to follow any further discussion on this t= opic, then please take a look at the comments on HADOOP-13076. --Chris Nauroth From: Chris Nauroth > Date: Friday, April 29, 2016 at 9:03 PM To: Elliot West >, "user@hadoop.a= pache.org" > Subject: Re: S3 Hadoop FileSystems Hello Elliot, The current state of support for the various S3 file system implementations= within the Apache Hadoop community can be summed up as follows: s3: Soon to be deprecated, not actively maintained, appears to not work rel= iably at all in recent versions. s3n: Not yet on its way to deprecation, but also not actively maintained. s3a: This is seen as the direction forward for S3 integration, so this is w= here Hadoop contributors are currently focusing their energy. Regarding interoperability with EMR, I can't speak from any of my own exper= ience on how to achieve this. We know that EMR runs custom code different = from what you'll see in the Apache repos. I think that creates a risk for = interop. My only suggestion would be to experiment and make sure to test a= ny of your interop scenarios end-to-end very thoroughly. As you noticed, s3n no longer has a 5 GB limitation. Issue HADOOP-9454 int= roduced support for files larger than 5 GB by using multi-part upload. Thi= s patch was released in Apache Hadoop 2.4.0. Regarding lack of MD5 verification in s3a, I believe that is just an oversi= ght, not an intentional design choice. I filed HADOOP-13076 to track addin= g this feature in s3a. --Chris Nauroth From: Elliot West > Date: Thursday, April 28, 2016 at 5:01 AM To: "user@hadoop.apache.org" > Subject: S3 Hadoop FileSystems Hello, I'm working on a project that moves data from HDFS file systems into S3 for= analysis with Hive on EMR. Recently I've become quite confused with the st= ate of play regarding the different FileSystems: s3, s3n, and s3a. For my u= se case I require the following: * Support for the transfer of very large files. * MD5 checks on copy operations to provide data verification. * Excellent compatibility within an EMR/Hive environment. To move data between clusters it would seem that current versions of the Na= tiveS3FileSystem are my best bet; It appears that only s3n provides MD5 che= cking. It is often cited that s3n does not support files over 5GB = but I can find no indication of such a limitation in the source code, in fa= ct I see that it switches over to multi-part upload for larger files. So, has this limitation been removed in s3n? Within EMR Amazon appear to recommend s3, support s3n, and advise against s= 3a. So yet again s3n would appear to win out here too? = I assume that the s3n implementation available in EMR is different to that = in Apache Hadoop? I find it hard to imagine that AWS would use JetS3t inste= ad of their own AWS Java client, but perhaps they do? Finally, could I use NativeS3FileSystem to perform the actual transfer on m= y Apache Hadoop cluster but then rewrite the table locations in my EMR Hive= metastore to use the s3:// protocol prefix? Could that work? I'd appreciate any light that can be shed on these questions, and any advic= e regarding my reasoning behind my proposal to use s3n for this particular = use case. Thanks, Elliot. --_000_D34E25A6420A7cnaurothhortonworkscom_ Content-Type: text/html; charset="us-ascii" Content-ID: <56F4F0B17AE83B46B96727394302D5D1@exch080.serverpod.net> Content-Transfer-Encoding: quoted-printable
Hello Elliot,

You're welcome, and the time was not wasted at all.  This is exac= tly the kind of valuable discussion that we like to share on the user@ list= .  As an outcome, we now have a more definitive answer about how MD5 v= erification works in s3a.  Thank you for starting the discussion.

--Chris Nauroth

From: Elliot West <teabot@gmail.com>
Date: Tuesday, May 3, 2016 at 2:50 = AM
To: Chris Nauroth <cnauroth@hortonworks.com>
Cc: "user@hadoop.apache.org" <user@hadoop.apache.org>
Subject: Re: S3 Hadoop FileSystems<= br>

Thank you,

I had a look at HADOOP-13076 and associated codes snippets in the = ;AWS SDK. I agree that the MD5 check does appear t= o be taking place after all. I appreciate your efforts in looking into that matter and raising the ticket.

Apologies for any time wasting that I may have caused.

Cheers - Elliot.

On 30 April 2016 at 23:16, Chris Nauroth <cnauroth@= hortonworks.com> wrote:
I have some more information regarding MD5 verification with s3a. = ; It turns out that s3a does have the MD5 verification.  It's just not= visible from reading the s3a code, because the MD5 verification is perform= ed entirely within the AWS SDK library dependency.  If you're interested in more details on how this works, or if you wa= nt to follow any further discussion on this topic, then please take a look = at the comments on HADOOP-13076.

--Chris Nauroth<= /font>

From: Chris Nauroth <cnauroth@hortonworks.com= >
Date: Friday, April 29, 2016 at 9:0= 3 PM
To: Elliot West <teabot@gmail.com>, "user@hadoop.apache.= org" <user@hadoop.apache.org>
Subject: Re: S3 Hadoop FileS= ystems

Hello Elliot,

The current state of support for the various S3 file system implementa= tions within the Apache Hadoop community can be summed up as follows:

s3: Soon to be deprecated, not actively maintained, appears to not wor= k reliably at all in recent versions.
s3n: Not yet on its way to deprecation, but also not actively maintain= ed.
s3a: This is seen as the direction forward for S3 integration, so this= is where Hadoop contributors are currently focusing their energy.

Regarding interoperability with EMR, I can't speak from any of my own = experience on how to achieve this.  We know that EMR runs custom code = different from what you'll see in the Apache repos.  I think that crea= tes a risk for interop.  My only suggestion would be to experiment and make sure to test any of your interop scenarios= end-to-end very thoroughly.

As you noticed, s3n no longer has a 5 GB limitation.  Issue HADOO= P-9454 introduced support for files larger than 5 GB by using multi-part up= load.  This patch was released in Apache Hadoop 2.4.0.

Regarding lack of MD5 verification in s3a, I believe that is just an o= versight, not an intentional design choice.  I filed HADOOP-13076 to t= rack adding this feature in s3a.

--Chris Nauroth<= /font>

From: Elliot West <teabot@gmail.com>
Date: Thursday, April 28, 2016 at 5= :01 AM
To: "user@hadoop.apache.org" <user@hadoop.apache= .org>
Subject: S3 Hadoop FileSystems

Hello,

I'm working on a project that moves data from HDFS file systems into S= 3 for analysis with Hive on EMR. Recently I've become quite confused with t= he state of play regarding the different FileSystems: s3, s3n, and s3a. For= my use case I require the following:
  • Support for the transfer of very large files.
  • MD5 checks on cop= y operations to provide data verification.
  • Excellent compatibility = within an EMR/Hive environment.
To move data between clusters it would seem that current versions of t= he NativeS3FileSystem are my best bet; It appears that only s3n provides MD5 checking. It is often cited that s3n does not support = files over 5GB but I can find no indication of such a limitation in the sou= rce code, in fact I see that it switches over to multi-part upload for larger files. So, has this limit= ation been removed in s3n?

Within EMR Amazon appear to recommend s3, support s3n, and advise against s3a. So yet again s3n wou= ld appear to win out here too? I assume that the s3n implementation availab= le in EMR is different to that in Apache Hadoop? I find it hard to imagine = that AWS would use JetS3t instead of their own AWS Java client, but perhaps they do?

Finally, could I use NativeS3FileSy= stem to perform the actual transfer on my Apache Hadoop cluster but = then rewrite the table locations in my EMR Hive metastore to use the s3:// protocol prefix? Could that= work?

I'd appreciate any light that can be shed on these questions, and any = advice regarding my reasoning behind my proposal to use s3n for this partic= ular use case.

Thanks,

Elliot.



--_000_D34E25A6420A7cnaurothhortonworkscom_--