hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edmund Kohlwey <ekohl...@gmail.com>
Subject Re: Automate EC2 cluster termination
Date Tue, 10 Nov 2009 14:36:48 GMT
You should be able to detect the status of the job in your java main() 
method, just do either: job.waitForCompletion(), and, when the job 
finishes running, use job.isSuccessful(), or if you want to you can 
write a custom "watcher" thread to poll job status manually; this will 
allow you to, for instance, launch several jobs and wait for them to 
return. You will poll the job tracker using either method, but I think 
the overhead is pretty minimal.

I'm not sure if its necessary to copy data from S3 to DFS, btw (unless 
you have a performance reason to do so... even then, since you're not 
really guaranteed very much locality on EC2 you probably won't see a 
huge difference). You should probably just set the default file system 
to s3. See http://wiki.apache.org/hadoop/AmazonS3 .


On 11/10/09 9:13 AM, John Clarke wrote:
> Hi,
>
> I use EC2 to run my Hadoop jobs using Cloudera's 0.18.3 AMI. It works great
> but I want to automate it a bit more.
>
> I want to be able to:
> - start cluster
> - copy data from S3 to the DFS
> - run the job
> - copy result data from DFS to S3
> - verify it all copied ok
> - shutdown the cluster.
>
>
> I guess the hardest part is reliably detecting when a job is complete. I've
> seen solutions that provide a time based shutdown but they are not suitable
> as our jobs vary in time.
>
> Has anyone made a script that does this already? I'm using the Cloudera
> python scripts to start/terminate my cluster.
>
> Thanks,
> John
>
>    


Mime
View raw message