Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of
 charmalloc@allthingshadoop.com designates 209.85.161.176 as permitted sender)
Subject: Re: Best practices - Large Hadoop Cluster
References: <982106.53616.qm@web33504.mail.mud.yahoo.com>
 <8B12A3D37FC04AEB9AE98ED290EB937E@china.huawei.com>
 <COL117-W60BAEEFC6E5C76EBF31A878F950@phx.gbl>
 <997223.34532.qm@web33503.mail.mud.yahoo.com> <4C628C2A.9060005@apache.org>
From: Joe Stein <charmalloc@allthingshadoop.com>
Content-Type: text/plain;
	charset=us-ascii
In-Reply-To: <4C628C2A.9060005@apache.org>
Message-Id: <1C89D00A-5195-4CDF-AA89-EE49792EDCF6@allthingshadoop.com>
Date: Wed, 11 Aug 2010 10:43:48 -0400
To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (iPhone Mail 8A306)

Not sure this was mentioned already but Adobe open sourced their puppet impl=
   http://github.com/hstack/puppet as well as a nice post in regards to it h=
ttp://hstack.org/hstack-automated-deployment-using-puppet/

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/

On Aug 11, 2010, at 7:40 AM, Steve Loughran <stevel@apache.org> wrote:

> On 10/08/10 21:06, Raj V wrote:
>> Mike
>> 512 nodes, even a minute for each node ( ssh-ing to each node, typing a 8=

>> character password, ensuring that everything looks ok) is about 8.5 hours=
. After
>> that if something does not work, that is a different level of pain altoge=
ther.
>>=20
>> Using scp to exchange keys simply does not scale.
>>=20
>> My question was simple, how do other people in the group who run large cl=
usters
>> manage this?  Brian put it better; Whats is the best, duplicatable  way o=
f
>> running hadoop  when the cluster is large. I agree, this is not a hadoop
>> question per se, but hadoop is really what I care about now.
>>=20
>=20
>=20
> SSH is great, but you still shouldn't be playing around trying to do thing=
s by hand, even those parallel SSH tools break the moment you have a hint of=
 inconsistency between machines.
>=20
>=20
> Instead general practise in managing *any large datacentre scale applicati=
on*, be it hadoop or not is automate things so the machines do the work them=
selves, leaving sysadmins to deal with important issues like why all packets=
 are being routed via singapore or whether the HDD failure rate is statistic=
ally significant.
>=20
> The standard techniques usually one of
>=20
> * build your own RPMs, deb files, push out stuff with kickstart, change a m=
achine by rebuilding its root disk.
> Strengths: good for clean builds
> Weaknesses: a lot of work, doesn't do recovery
>=20
> * Model driven tools. I know most people now say "yes, puppet", but actual=
ly cfEngine and bcfg2 have been around for a while, SmartFrog is what we use=
. In these tools, you specify what you want, they keep an eye on things and p=
ush the machines back into the desired state.
> Strengths: recovers from bad state, keeps the machines close to the desire=
d state
> Weaknesses: if the desired state is not consistent, they tend to circle be=
tween the various unreachable states.
>=20
> * Scripts. People end up doing this without thinking.
> Strengths: take your commands and script them, strong order to operations
> Weaknesses: bad at recovery.
>=20
> * VM images, maintained by hand or another technique
> Strengths: OK if you have one gold image that can be pushed out every time=
 a VM is created -and VMs are short lived.
> Weaknesses: Unless your VMs are short lived, you've just created a mainten=
ance nightmare worse than before.
>=20
>=20
> Hadoop itself is not too bad at handling failures of individual machines, b=
ut the general best practices in large cluster management (look at LISA proc=
eedings) are pretty much foundational.
>=20
> http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment
>=20
> -Steve