hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: automatic node discovery
Date Tue, 18 Oct 2011 10:43:04 GMT
On 18/10/11 10:48, Petru Dimulescu wrote:
> Hello,
> I wonder how do you guys see the problem of automatic node discovery:
> having, for instance, a couple of hadoops, with no configuration
> explicitly set whatsoever, simply discover each other and work together,
> like Gridgain does: just fire up two instances of the product, on the
> same machine or on different machines in the same LAN, they will use
> mulitcast or whatever to discover each other

you can use techniques like Bonjour to have hadoop services register 
themselves in DNS and locate that way, but things only need to discover 
the NN and JT and report in.

 > and to be a part of a
> self-discovered topology.

Topology inference is an interesting problem. Something purely for 
diagnostics could be useful.

> Of course, if you have special network requirements you should be able
> to specify undiscovarable nodes by IP or name but often grids are
> installed on LANs and it should really be simpler.

In a production system I'd have a private switch and isolate things for 
bandwidth and security; this is why auto configuration is generally 
neglected. If it were to be added, it would go via Zookeeper, leaving 
only the zookeeper discovery problem. You can't rely on DNS or multicast 
IP here as it doesn't always work in virtualised environments.

> Namenodes are a bit different, they should use safer machines, I'm
> basically talking about datanodes here, but still I wonder how hard can
> it be to have self-assigned namenodes, maybe replicated automatically on
> several machines, unless one specific namenode is explicitly set via xml
> configuration.

I wouldn't touch dynamic namenodes, you really need fixed NNs and 2nns 
and as automatic replication isn't there it's a non-issue.

With fixed NN and JT entries in the DNS table, anything can come up in 
the LAN and talk to them unless you set up the master nodes with lists 
of things you trust.

> Also, the ssh passwordless thing is so awkward. If you have a network of
> hadoop that mutually discover each other there is really no need for
> this passwordless ssh requirement. This is more of a system
> administrator aspect, if sysadmins want to automatically deploy or start
> a program on 5000 machines they often have the tools&skills to do that,
> it should not be a requirement.

It's not a requirement, there are other ways to deploy. Large clusters 
tend to use cluster management tooling that keeps the OS images 
consistent, or you can use more devops-centric tooling (inc Apache 
Whirr) to roll things out.

View raw message