hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris K Wensel <ch...@wensel.net>
Subject Re: Auto-shutdown for EC2 clusters
Date Fri, 24 Oct 2008 23:53:11 GMT

fyi, the src/contrib/ec2 scripts do just what Paco suggests.

minus the static IP stuff (you can use the scripts to login via  
cluster name, and spawn a tunnel for browsing nodes)

that is, you can spawn any number of uniquely named, configured, and  
sized clusters, and you can increase their size independently as well.  
(shrinking is another matter altogether)


On Oct 24, 2008, at 1:58 PM, Paco NATHAN wrote:

> Hi Karl,
> Rather than using separate key pairs, you can use EC2 security groups
> to keep track of different clusters.
> Effectively, that requires a new security group for every cluster --
> so just allocate a bunch of different ones in a config file, then have
> the launch scripts draw from those. We also use EC2 static IP
> addresses and then have a DNS entry named similarly to each security
> group, associated with a static IP once that cluster is launched.
> It's relatively simple to query the running instances and collect them
> according to security groups.
> One way to handle detecting failures is just to attempt SSH in a loop.
> Our rough estimate is that approximately 2% of the attempted EC2 nodes
> fail at launch. So we allocate more than enough, given that rate.
> In a nutshell, that's one approach for managing a Hadoop cluster
> remotely on EC2.
> Best,
> Paco
> On Fri, Oct 24, 2008 at 2:07 PM, Karl Anderson <kra@monkey.org> wrote:
>> On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote:
>>> This workflow could be initiated from a crontab -- totally  
>>> automated.
>>> However, we still see occasional failures of the cluster, and must
>>> restart manually, but not often.  Stability for that has improved  
>>> much
>>> since the 0.18 release.  For us, it's getting closer to total
>>> automation.
>>> FWIW, that's running on EC2 m1.xl instances.
>> Same here.  I've always had the namenode and web interface be  
>> accessible,
>> but sometimes I don't get the slave nodes - usually zero slaves  
>> when this
>> happens, sometimes I only miss one or two.  My rough estimate is  
>> that this
>> happens 1% of the time.
>> I currently have to notice this and restart manually.  Do you have  
>> a good
>> way to detect it?  I have several Hadoop clusters running at once  
>> with the
>> same AWS image and SSH keypair, so I can't count running  
>> instances.  I could
>> have a separate keypair per cluster and count instances with that  
>> keypair,
>> but I'd like to be able to start clusters opportunistically, with  
>> more than
>> one cluster doing the same kind of job on different data.
>> Karl Anderson
>> kra@monkey.org
>> http://monkey.org/~kra

Chris K Wensel

View raw message