Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <CB3C43E3.11D%matt.pouttu-clarke@icrossing.com>
References: 
 <CANYdkkM6iw9kWoEUWa8tT71uh2mQKA92EVd4yMQ_915007X94A@mail.gmail.com>
	<CB3C43E3.11D%matt.pouttu-clarke@icrossing.com>
Date: Wed, 18 Jan 2012 10:56:37 -0600
Message-ID: 
 <CANYdkkNWBfqR5vvkrgw_BTg8rhbtNfR=t+USsMb6iSuL-Yd58A@mail.gmail.com>
Subject: Re: Using S3 instead of HDFS
From: Mark Kerzner <mark.kerzner@shmsoft.com>
To: common-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=e89a8f6438c2b36ffd04b6d054d5

--e89a8f6438c2b36ffd04b6d054d5
Content-Type: text/plain; charset=ISO-8859-1

Awesome important, Matt, thank you so much!

Mark

On Wed, Jan 18, 2012 at 10:53 AM, Matt Pouttu-Clarke <
Matt.Pouttu-Clarke@icrossing.com> wrote:

> I would strongly suggest using this method to read S3 only.
>
> I have had problems with writing large volumes of data to S3 from Hadoop
> using native s3fs.  Supposedly a fix is on the way from Amazon (it is an
> undocumented internal error being thrown).  However, this fix is already 2
> months later than we expected it and we currently have no ETA.
>
> If you want to write data to S3 reliably, you should use the S3 API
> directly and stream data from HDFS into S3.  Just remember that S3
> requires the final size of the data before you start writing so it is not
> true streaming in that sense.  After you have completed writing your part
> files in your job (writing to HDFS), you can write a map-only job to
> stream the data up into S3 using the S3 API directly.
>
> In no way, shape, or form should S3 be currently considered as a
> replacement for HDFS when it come to writes.  Your jobs will need to be
> modified and customized to write to S3 reliably, there are files size
> limits on writes, and the multi-part upload option does not work correctly
> and randomly throws an internal Amazon error.
>
> You have been warned!
>
> -Matt
>
> On 1/18/12 9:37 AM, "Mark Kerzner" <mark.kerzner@shmsoft.com> wrote:
>
> >It worked, thank you, Harsh.
> >
> >Mark
> >
> >On Wed, Jan 18, 2012 at 1:16 AM, Harsh J <harsh@cloudera.com> wrote:
> >
> >> Ah sorry about missing that. Settings would go in core-site.xml
> >> (hdfs-site.xml will no longer be relevant anymore, once you switch to
> >>using
> >> S3).
> >>
> >> On 18-Jan-2012, at 12:36 PM, Mark Kerzner wrote:
> >>
> >> > That wiki page mentiones hadoop-site.xml, but this is old, now you
> >>have
> >> > core-site.xml and hdfs-site.xml, so which one do you put it in?
> >> >
> >> > Thank you (and good night Central Time:)
> >> >
> >> > mark
> >> >
> >> > On Wed, Jan 18, 2012 at 12:52 AM, Harsh J <harsh@cloudera.com> wrote:
> >> >
> >> >> When using S3 you do not need to run any component of HDFS at all. It
> >> >> is meant to be an alternate FS choice. You need to run only MR.
> >> >>
> >> >> The wiki page at http://wiki.apache.org/hadoop/AmazonS3 mentions on
> >> >> how to go about specifying your auth details to S3, either directly
> >> >> via the fs.default.name URI or via the additional properties
> >> >> fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. Does this not work
> >> >> for you?
> >> >>
> >> >> On Wed, Jan 18, 2012 at 12:14 PM, Mark Kerzner <
> >> mark.kerzner@shmsoft.com>
> >> >> wrote:
> >> >>> Well, here is my error message
> >> >>>
> >> >>> Starting Hadoop namenode daemon: starting namenode, logging to
> >> >>> /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out
> >> >>> ERROR. Could not start Hadoop namenode daemon
> >> >>> Starting Hadoop secondarynamenode daemon: starting
> >>secondarynamenode,
> >> >>> logging to
> >> >>>
> >> >>
> >>
> >>/usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26
> >>.out
> >> >>> Exception in thread "main" java.lang.IllegalArgumentException:
> >>Invalid
> >> >> URI
> >> >>> for NameNode address (check fs.default.name): s3n://myname.testdata
> >>is
> >> >> not
> >> >>> of scheme 'hdfs'.
> >> >>>       at
> >> >>>
> >> >>
> >>
> >>org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:
> >>224)
> >> >>>       at
> >> >>>
> >> >>
> >>
> >>org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNod
> >>e.java:209)
> >> >>>       at
> >> >>>
> >> >>
> >>
> >>org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(Secon
> >>daryNameNode.java:182)
> >> >>>       at
> >> >>>
> >> >>
> >>
> >>org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.<init>(Secondary
> >>NameNode.java:150)
> >> >>>       at
> >> >>>
> >> >>
> >>
> >>org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNa
> >>meNode.java:624)
> >> >>> ERROR. Could not start Hadoop secondarynamenode daemon
> >> >>>
> >> >>> And, if I don't need to start the NameNode, then where do I give
> >>the S3
> >> >>> credentials?
> >> >>>
> >> >>> Thank you,
> >> >>> Mark
> >> >>>
> >> >>>
> >> >>> On Wed, Jan 18, 2012 at 12:36 AM, Harsh J <harsh@cloudera.com>
> >>wrote:
> >> >>>
> >> >>>> Hey Mark,
> >> >>>>
> >> >>>> What is the exact trouble you run into? What do the error messages
> >> >>>> indicate?
> >> >>>>
> >> >>>> This should be definitive enough I think:
> >> >>>> http://wiki.apache.org/hadoop/AmazonS3
> >> >>>>
> >> >>>> On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner <
> >> >> mark.kerzner@shmsoft.com>
> >> >>>> wrote:
> >> >>>>> Hi,
> >> >>>>>
> >> >>>>> whatever I do, I can't make it work, that is, I cannot use
> >> >>>>>
> >> >>>>> s3://host
> >> >>>>>
> >> >>>>> or s3n://host
> >> >>>>>
> >> >>>>> as a replacement for HDFS while runnings EC2 cluster. I change the
> >> >>>> settings
> >> >>>>> in the core-file.xml, in hdfs-site.xml, and start hadoop services,
> >> >> and it
> >> >>>>> fails with error messages.
> >> >>>>>
> >> >>>>> Is there a place where this is clearly described?
> >> >>>>>
> >> >>>>> Thank you so much.
> >> >>>>>
> >> >>>>> Mark
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> Harsh J
> >> >>>> Customer Ops. Engineer, Cloudera
> >> >>>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Harsh J
> >> >> Customer Ops. Engineer, Cloudera
> >> >>
> >>
> >> --
> >> Harsh J
> >> Customer Ops. Engineer, Cloudera
> >>
> >>
>
> ________________________________
> iCrossing Privileged and Confidential Information
> This email message is for the sole use of the intended recipient(s) and
> may contain confidential and privileged information of iCrossing. Any
> unauthorized review, use, disclosure or distribution is prohibited. If you
> are not the intended recipient, please contact the sender by reply email
> and destroy all copies of the original message.
>

--e89a8f6438c2b36ffd04b6d054d5--