Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 66295B8AB for ; Wed, 18 Jan 2012 16:57:10 +0000 (UTC) Received: (qmail 18638 invoked by uid 500); 18 Jan 2012 16:57:06 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 18550 invoked by uid 500); 18 Jan 2012 16:57:05 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 18541 invoked by uid 99); 18 Jan 2012 16:57:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Jan 2012 16:57:05 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.161.176] (HELO mail-gx0-f176.google.com) (209.85.161.176) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Jan 2012 16:56:58 +0000 Received: by ggnl4 with SMTP id l4so5228319ggn.35 for ; Wed, 18 Jan 2012 08:56:37 -0800 (PST) MIME-Version: 1.0 Received: by 10.182.202.69 with SMTP id kg5mr19910443obc.35.1326905797340; Wed, 18 Jan 2012 08:56:37 -0800 (PST) Received: by 10.182.116.68 with HTTP; Wed, 18 Jan 2012 08:56:37 -0800 (PST) X-Originating-IP: [98.198.171.77] In-Reply-To: References: Date: Wed, 18 Jan 2012 10:56:37 -0600 Message-ID: Subject: Re: Using S3 instead of HDFS From: Mark Kerzner To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=e89a8f6438c2b36ffd04b6d054d5 X-Virus-Checked: Checked by ClamAV on apache.org --e89a8f6438c2b36ffd04b6d054d5 Content-Type: text/plain; charset=ISO-8859-1 Awesome important, Matt, thank you so much! Mark On Wed, Jan 18, 2012 at 10:53 AM, Matt Pouttu-Clarke < Matt.Pouttu-Clarke@icrossing.com> wrote: > I would strongly suggest using this method to read S3 only. > > I have had problems with writing large volumes of data to S3 from Hadoop > using native s3fs. Supposedly a fix is on the way from Amazon (it is an > undocumented internal error being thrown). However, this fix is already 2 > months later than we expected it and we currently have no ETA. > > If you want to write data to S3 reliably, you should use the S3 API > directly and stream data from HDFS into S3. Just remember that S3 > requires the final size of the data before you start writing so it is not > true streaming in that sense. After you have completed writing your part > files in your job (writing to HDFS), you can write a map-only job to > stream the data up into S3 using the S3 API directly. > > In no way, shape, or form should S3 be currently considered as a > replacement for HDFS when it come to writes. Your jobs will need to be > modified and customized to write to S3 reliably, there are files size > limits on writes, and the multi-part upload option does not work correctly > and randomly throws an internal Amazon error. > > You have been warned! > > -Matt > > On 1/18/12 9:37 AM, "Mark Kerzner" wrote: > > >It worked, thank you, Harsh. > > > >Mark > > > >On Wed, Jan 18, 2012 at 1:16 AM, Harsh J wrote: > > > >> Ah sorry about missing that. Settings would go in core-site.xml > >> (hdfs-site.xml will no longer be relevant anymore, once you switch to > >>using > >> S3). > >> > >> On 18-Jan-2012, at 12:36 PM, Mark Kerzner wrote: > >> > >> > That wiki page mentiones hadoop-site.xml, but this is old, now you > >>have > >> > core-site.xml and hdfs-site.xml, so which one do you put it in? > >> > > >> > Thank you (and good night Central Time:) > >> > > >> > mark > >> > > >> > On Wed, Jan 18, 2012 at 12:52 AM, Harsh J wrote: > >> > > >> >> When using S3 you do not need to run any component of HDFS at all. It > >> >> is meant to be an alternate FS choice. You need to run only MR. > >> >> > >> >> The wiki page at http://wiki.apache.org/hadoop/AmazonS3 mentions on > >> >> how to go about specifying your auth details to S3, either directly > >> >> via the fs.default.name URI or via the additional properties > >> >> fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. Does this not work > >> >> for you? > >> >> > >> >> On Wed, Jan 18, 2012 at 12:14 PM, Mark Kerzner < > >> mark.kerzner@shmsoft.com> > >> >> wrote: > >> >>> Well, here is my error message > >> >>> > >> >>> Starting Hadoop namenode daemon: starting namenode, logging to > >> >>> /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out > >> >>> ERROR. Could not start Hadoop namenode daemon > >> >>> Starting Hadoop secondarynamenode daemon: starting > >>secondarynamenode, > >> >>> logging to > >> >>> > >> >> > >> > >>/usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26 > >>.out > >> >>> Exception in thread "main" java.lang.IllegalArgumentException: > >>Invalid > >> >> URI > >> >>> for NameNode address (check fs.default.name): s3n://myname.testdata > >>is > >> >> not > >> >>> of scheme 'hdfs'. > >> >>> at > >> >>> > >> >> > >> > >>org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java: > >>224) > >> >>> at > >> >>> > >> >> > >> > >>org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNod > >>e.java:209) > >> >>> at > >> >>> > >> >> > >> > >>org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(Secon > >>daryNameNode.java:182) > >> >>> at > >> >>> > >> >> > >> > >>org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(Secondary > >>NameNode.java:150) > >> >>> at > >> >>> > >> >> > >> > >>org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNa > >>meNode.java:624) > >> >>> ERROR. Could not start Hadoop secondarynamenode daemon > >> >>> > >> >>> And, if I don't need to start the NameNode, then where do I give > >>the S3 > >> >>> credentials? > >> >>> > >> >>> Thank you, > >> >>> Mark > >> >>> > >> >>> > >> >>> On Wed, Jan 18, 2012 at 12:36 AM, Harsh J > >>wrote: > >> >>> > >> >>>> Hey Mark, > >> >>>> > >> >>>> What is the exact trouble you run into? What do the error messages > >> >>>> indicate? > >> >>>> > >> >>>> This should be definitive enough I think: > >> >>>> http://wiki.apache.org/hadoop/AmazonS3 > >> >>>> > >> >>>> On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner < > >> >> mark.kerzner@shmsoft.com> > >> >>>> wrote: > >> >>>>> Hi, > >> >>>>> > >> >>>>> whatever I do, I can't make it work, that is, I cannot use > >> >>>>> > >> >>>>> s3://host > >> >>>>> > >> >>>>> or s3n://host > >> >>>>> > >> >>>>> as a replacement for HDFS while runnings EC2 cluster. I change the > >> >>>> settings > >> >>>>> in the core-file.xml, in hdfs-site.xml, and start hadoop services, > >> >> and it > >> >>>>> fails with error messages. > >> >>>>> > >> >>>>> Is there a place where this is clearly described? > >> >>>>> > >> >>>>> Thank you so much. > >> >>>>> > >> >>>>> Mark > >> >>>> > >> >>>> > >> >>>> > >> >>>> -- > >> >>>> Harsh J > >> >>>> Customer Ops. Engineer, Cloudera > >> >>>> > >> >> > >> >> > >> >> > >> >> -- > >> >> Harsh J > >> >> Customer Ops. Engineer, Cloudera > >> >> > >> > >> -- > >> Harsh J > >> Customer Ops. Engineer, Cloudera > >> > >> > > ________________________________ > iCrossing Privileged and Confidential Information > This email message is for the sole use of the intended recipient(s) and > may contain confidential and privileged information of iCrossing. Any > unauthorized review, use, disclosure or distribution is prohibited. If you > are not the intended recipient, please contact the sender by reply email > and destroy all copies of the original message. > --e89a8f6438c2b36ffd04b6d054d5--