hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Best practices - Large Hadoop Cluster
Date Wed, 11 Aug 2010 11:40:26 GMT
On 10/08/10 21:06, Raj V wrote:
> Mike
> 512 nodes, even a minute for each node ( ssh-ing to each node, typing a 8
> character password, ensuring that everything looks ok) is about 8.5 hours. After
> that if something does not work, that is a different level of pain altogether.
> Using scp to exchange keys simply does not scale.
> My question was simple, how do other people in the group who run large clusters
> manage this?  Brian put it better; Whats is the best, duplicatable  way of
> running hadoop  when the cluster is large. I agree, this is not a hadoop
> question per se, but hadoop is really what I care about now.

SSH is great, but you still shouldn't be playing around trying to do 
things by hand, even those parallel SSH tools break the moment you have 
a hint of inconsistency between machines.

Instead general practise in managing *any large datacentre scale 
application*, be it hadoop or not is automate things so the machines do 
the work themselves, leaving sysadmins to deal with important issues 
like why all packets are being routed via singapore or whether the HDD 
failure rate is statistically significant.

The standard techniques usually one of

  * build your own RPMs, deb files, push out stuff with kickstart, 
change a machine by rebuilding its root disk.
  Strengths: good for clean builds
  Weaknesses: a lot of work, doesn't do recovery

  * Model driven tools. I know most people now say "yes, puppet", but 
actually cfEngine and bcfg2 have been around for a while, SmartFrog is 
what we use. In these tools, you specify what you want, they keep an eye 
on things and push the machines back into the desired state.
  Strengths: recovers from bad state, keeps the machines close to the 
desired state
  Weaknesses: if the desired state is not consistent, they tend to 
circle between the various unreachable states.

  * Scripts. People end up doing this without thinking.
  Strengths: take your commands and script them, strong order to operations
  Weaknesses: bad at recovery.

* VM images, maintained by hand or another technique
  Strengths: OK if you have one gold image that can be pushed out every 
time a VM is created -and VMs are short lived.
  Weaknesses: Unless your VMs are short lived, you've just created a 
maintenance nightmare worse than before.

Hadoop itself is not too bad at handling failures of individual 
machines, but the general best practices in large cluster management 
(look at LISA proceedings) are pretty much foundational.



View raw message