hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Stein <charmal...@allthingshadoop.com>
Subject Re: Best practices - Large Hadoop Cluster
Date Wed, 11 Aug 2010 14:43:48 GMT
Not sure this was mentioned already but Adobe open sourced their puppet impl   http://github.com/hstack/puppet
as well as a nice post in regards to it http://hstack.org/hstack-automated-deployment-using-puppet/

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/

On Aug 11, 2010, at 7:40 AM, Steve Loughran <stevel@apache.org> wrote:

> On 10/08/10 21:06, Raj V wrote:
>> Mike
>> 512 nodes, even a minute for each node ( ssh-ing to each node, typing a 8
>> character password, ensuring that everything looks ok) is about 8.5 hours. After
>> that if something does not work, that is a different level of pain altogether.
>> 
>> Using scp to exchange keys simply does not scale.
>> 
>> My question was simple, how do other people in the group who run large clusters
>> manage this?  Brian put it better; Whats is the best, duplicatable  way of
>> running hadoop  when the cluster is large. I agree, this is not a hadoop
>> question per se, but hadoop is really what I care about now.
>> 
> 
> 
> SSH is great, but you still shouldn't be playing around trying to do things by hand,
even those parallel SSH tools break the moment you have a hint of inconsistency between machines.
> 
> 
> Instead general practise in managing *any large datacentre scale application*, be it
hadoop or not is automate things so the machines do the work themselves, leaving sysadmins
to deal with important issues like why all packets are being routed via singapore or whether
the HDD failure rate is statistically significant.
> 
> The standard techniques usually one of
> 
> * build your own RPMs, deb files, push out stuff with kickstart, change a machine by
rebuilding its root disk.
> Strengths: good for clean builds
> Weaknesses: a lot of work, doesn't do recovery
> 
> * Model driven tools. I know most people now say "yes, puppet", but actually cfEngine
and bcfg2 have been around for a while, SmartFrog is what we use. In these tools, you specify
what you want, they keep an eye on things and push the machines back into the desired state.
> Strengths: recovers from bad state, keeps the machines close to the desired state
> Weaknesses: if the desired state is not consistent, they tend to circle between the various
unreachable states.
> 
> * Scripts. People end up doing this without thinking.
> Strengths: take your commands and script them, strong order to operations
> Weaknesses: bad at recovery.
> 
> * VM images, maintained by hand or another technique
> Strengths: OK if you have one gold image that can be pushed out every time a VM is created
-and VMs are short lived.
> Weaknesses: Unless your VMs are short lived, you've just created a maintenance nightmare
worse than before.
> 
> 
> Hadoop itself is not too bad at handling failures of individual machines, but the general
best practices in large cluster management (look at LISA proceedings) are pretty much foundational.
> 
> http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment
> 
> -Steve

Mime
View raw message