hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "tim robertson" <timrobertson...@gmail.com>
Subject Re: Hadoop on EC2 + S3 - best practice?
Date Tue, 01 Jul 2008 20:16:31 GMT
Hi Tom,
Thanks for the reply, and after posting I found your blogs and followed your
instructions - thanks

There were a couple of gotchya's
1) My <secret> had a / in it and the escaping does not work
2) I copied to the root directory in the S3 bucket and I could not manage to
get it out again using a distcp, so I needed to blow it away and do another
copy up.

It was nice to get running in the end, and I blogged my experience:
http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html(I
thank you at the bottom ;o)

Thanks,

Tim




On Tue, Jul 1, 2008 at 6:35 PM, Tom White <tom.e.white@gmail.com> wrote:

> Hi Tim,
>
> The steps you outline look about right. Because your file is >5GB you
> will need to use the S3 block file system, which has a s3 URL. (See
> http://wiki.apache.org/hadoop/AmazonS3) You shouldn't have to build
> your own AMI unless you have dependencies that can't be submitted as a
> part of the MapReduce job.
>
> To read and write to S3 you can just use s3 URLs. Otherwise you can
> use distcp to copy between S3 and HDFS before and after running your
> job. This article I wrote has some more tips:
> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873
>
> Hope that helps,
>
> Tom
>
> On Sat, Jun 28, 2008 at 10:24 AM, tim robertson
> <timrobertson100@gmail.com> wrote:
> > Hi all,
> > I have data in a file (150million lines at 100Gb or so) and have several
> > MapReduce classes for my processing (custom index generation).
> >
> > Can someone please confirm the following is the best way to run on EC2
> and
> > S3 (both of which I am new to..)
> >
> > 1) load my 100Gb file into S3
> > 2) create a class that will load the file from S3 and use as input to
> > mapreduce (S3 not used during processing) and save output back to S3
> > 3) create an AMI with the Hadoop + dependencies and my Jar file (loading
> the
> > S3 input and the MR code) - I will base this on the public Hadoop AMI I
> > guess
> > 4) run using the standard scripts
> >
> > Is this best practice?
> > I assume this is pretty common... is there a better way where I can
> submit
> > my Jar at runtime and just pass in the URL for the input and output files
> in
> > S3?
> >
> > If not, has anyone an example that takes input from S3 and writes output
> to
> > S3 also?
> >
> > Thanks for advice, or suggestions of best way to run.
> >
> > Tim
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message