hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White" <tom.e.wh...@gmail.com>
Subject Re: Hadoop on EC2 + S3 - best practice?
Date Tue, 01 Jul 2008 16:35:53 GMT
Hi Tim,

The steps you outline look about right. Because your file is >5GB you
will need to use the S3 block file system, which has a s3 URL. (See
http://wiki.apache.org/hadoop/AmazonS3) You shouldn't have to build
your own AMI unless you have dependencies that can't be submitted as a
part of the MapReduce job.

To read and write to S3 you can just use s3 URLs. Otherwise you can
use distcp to copy between S3 and HDFS before and after running your
job. This article I wrote has some more tips:

Hope that helps,


On Sat, Jun 28, 2008 at 10:24 AM, tim robertson
<timrobertson100@gmail.com> wrote:
> Hi all,
> I have data in a file (150million lines at 100Gb or so) and have several
> MapReduce classes for my processing (custom index generation).
> Can someone please confirm the following is the best way to run on EC2 and
> S3 (both of which I am new to..)
> 1) load my 100Gb file into S3
> 2) create a class that will load the file from S3 and use as input to
> mapreduce (S3 not used during processing) and save output back to S3
> 3) create an AMI with the Hadoop + dependencies and my Jar file (loading the
> S3 input and the MR code) - I will base this on the public Hadoop AMI I
> guess
> 4) run using the standard scripts
> Is this best practice?
> I assume this is pretty common... is there a better way where I can submit
> my Jar at runtime and just pass in the URL for the input and output files in
> S3?
> If not, has anyone an example that takes input from S3 and writes output to
> S3 also?
> Thanks for advice, or suggestions of best way to run.
> Tim

View raw message