hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "tim robertson" <timrobertson...@gmail.com>
Subject Hadoop on EC2 + S3 - best practice?
Date Sat, 28 Jun 2008 09:24:54 GMT
Hi all,
I have data in a file (150million lines at 100Gb or so) and have several
MapReduce classes for my processing (custom index generation).

Can someone please confirm the following is the best way to run on EC2 and
S3 (both of which I am new to..)

1) load my 100Gb file into S3
2) create a class that will load the file from S3 and use as input to
mapreduce (S3 not used during processing) and save output back to S3
3) create an AMI with the Hadoop + dependencies and my Jar file (loading the
S3 input and the MR code) - I will base this on the public Hadoop AMI I
4) run using the standard scripts

Is this best practice?
I assume this is pretty common... is there a better way where I can submit
my Jar at runtime and just pass in the URL for the input and output files in

If not, has anyone an example that takes input from S3 and writes output to
S3 also?

Thanks for advice, or suggestions of best way to run.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message