hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White" <tom.e.wh...@gmail.com>
Subject Re: Auto-shutdown for EC2 clusters
Date Wed, 26 Nov 2008 17:41:43 GMT
I've just created a basic script to do something similar for running a
benchmark on EC2. See
https://issues.apache.org/jira/browse/HADOOP-4382. As it stands the
code for detecting when Hadoop is ready to accept jobs is simplistic,
to say the least, so any ideas for improvement would be great.


On Fri, Oct 24, 2008 at 11:53 PM, Chris K Wensel <chris@wensel.net> wrote:
> fyi, the src/contrib/ec2 scripts do just what Paco suggests.
> minus the static IP stuff (you can use the scripts to login via cluster
> name, and spawn a tunnel for browsing nodes)
> that is, you can spawn any number of uniquely named, configured, and sized
> clusters, and you can increase their size independently as well. (shrinking
> is another matter altogether)
> ckw
> On Oct 24, 2008, at 1:58 PM, Paco NATHAN wrote:
>> Hi Karl,
>> Rather than using separate key pairs, you can use EC2 security groups
>> to keep track of different clusters.
>> Effectively, that requires a new security group for every cluster --
>> so just allocate a bunch of different ones in a config file, then have
>> the launch scripts draw from those. We also use EC2 static IP
>> addresses and then have a DNS entry named similarly to each security
>> group, associated with a static IP once that cluster is launched.
>> It's relatively simple to query the running instances and collect them
>> according to security groups.
>> One way to handle detecting failures is just to attempt SSH in a loop.
>> Our rough estimate is that approximately 2% of the attempted EC2 nodes
>> fail at launch. So we allocate more than enough, given that rate.
>> In a nutshell, that's one approach for managing a Hadoop cluster
>> remotely on EC2.
>> Best,
>> Paco
>> On Fri, Oct 24, 2008 at 2:07 PM, Karl Anderson <kra@monkey.org> wrote:
>>> On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote:
>>>> This workflow could be initiated from a crontab -- totally automated.
>>>> However, we still see occasional failures of the cluster, and must
>>>> restart manually, but not often.  Stability for that has improved much
>>>> since the 0.18 release.  For us, it's getting closer to total
>>>> automation.
>>>> FWIW, that's running on EC2 m1.xl instances.
>>> Same here.  I've always had the namenode and web interface be accessible,
>>> but sometimes I don't get the slave nodes - usually zero slaves when this
>>> happens, sometimes I only miss one or two.  My rough estimate is that
>>> this
>>> happens 1% of the time.
>>> I currently have to notice this and restart manually.  Do you have a good
>>> way to detect it?  I have several Hadoop clusters running at once with
>>> the
>>> same AWS image and SSH keypair, so I can't count running instances.  I
>>> could
>>> have a separate keypair per cluster and count instances with that
>>> keypair,
>>> but I'd like to be able to start clusters opportunistically, with more
>>> than
>>> one cluster doing the same kind of job on different data.
>>> Karl Anderson
>>> kra@monkey.org
>>> http://monkey.org/~kra
> --
> Chris K Wensel
> chris@wensel.net
> http://chris.wensel.net/
> http://www.cascading.org/

View raw message