hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: How do people keep their client configurations in sync with the remote cluster(s)
Date Thu, 15 May 2008 17:33:30 GMT

I use several strategies:

A) avoid dependency on hadoop's configuration by using http access to files.
I use this, for example, where we have a PHP or grails or oracle app that
needs to read a data file or three from HDFS.

B) rsync early and often and lock down the config directory.

C) get a really good sysop who does (b) and shoots people who mess up

D) (we don't do this yet) establish a configuration repository using
zookeeper or a webdav or a (horrors) NFS file system.  At the very least, I
would like to be able to get namenode address and port.

Mostly, our apps are in the cluster and covered by b+c or very out of the
cluster and covered by a.  Many of our apps are pure import or pure export.
The import side really only needs to know where the namenode is and the pure
export only really needs the http access.  That makes the configuration
management task vastly easier.

Another serious (as in SERIOUS) problem is how you keep data-processing
elements from a QA or staging data chain from inserting bogus data into the
production data chain, but still have them work in production with minimal
reconfiguration on final deploy.  We don't have a particularly good solution
for that yet, but are planning on using zookeeper host based permissions to
good effect there.  That should let us have data mirrors that shadow the
production data feed system so that staged systems can process live data,
but be unable to insert it back into the production setting.  The mirror
will have read-only access to the feed meta-data and the staging machines
will have no access to the production feed meta-data and these limitations
will be imposed by a single configuration on the zookeeper rather than on
each machine.  This should allow us to keep it cleaner than these things
normally wind up.

But the short answer is that this is a hard problem to get really, really
right.  


On 5/15/08 5:05 AM, "Steve Loughran" <stevel@apache.org> wrote:

> 
> I have a question for users: how do they ensure their client apps have
> configuration XML file that are kept up to date?
> 
> I know how I do it to date (get the site config off the site team, have
> my private copy in SVN), but that is too brittle, and diagnosing
> failures is pretty tricky. All you get is "Failed to Submit Job!"
> exceptions and local stack traces, from which you have to work backwards
> to the work.
> 
> I'm thinking of looking at what it would take for a job submitter to ask
> the tracker for its config data, to get things like the various
> directory bases from the cluster, instead of being compiled into the
> client. Then the management problem becomes one of keeping the cluster
> configuration under control, which is a much easier proposition.
> 
> what do people do right now?
> 
> -steve


Mime
View raw message