hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@yahoo.com>
Subject Re: Automatic recovery Mechanism for namenode failure...
Date Wed, 23 Jul 2008 16:43:16 GMT
For our application we built service monitors that launch and monitor the
Hadoop (and HBase) daemons as subprocesses. The monitors publish the
service locations on a DHT that supports TTLs on values, with TTLs of 30s.
Should the subprocess die, the monitor also exits, then the DHT values
expire. Redundant cluster monitor processes monitor the DHT for service
failure and can restart via SSH command a failed process or can reassign
also via SSH command a service away from a failed node. By policy if the
namenode fails after reassignment/restart all of the datanodes are
restarted. Any service reliant on DFS that failed during DFS
unavailability would also be restarted. The service monitors do service
location discovery on the DHT and write hadoop-site.xml and hbase-site.xml
files accordingly so when dependent services restart they automatically
pick up any location changes. 

I suppose we could have done the above with Zookeeper instead of a DHT. 

I don't have any code that I can share, but the above took me less than a
week to accomplish, so I can say it is not difficult. 

Hope this helps, 

   - Andy

--- On Wed, 7/23/08, Pratyush Banerjee <pratyushbanerjee@aol.com> wrote:

> From: Pratyush Banerjee <pratyushbanerjee@aol.com>
> Subject: Automatic recovery Mechanism for namenode failure...
> To: core-dev@hadoop.apache.org
> Date: Wednesday, July 23, 2008, 1:10 AM
> Hi All,
> We have been using hadoop 0.17.1 for a 50 machine cluster.
> Since we have continuous weblogs being written into the
> HDFS therein, we are concerned about the failure of the
> namenode. Digging into hadoop documentation, i found out
> that currently hadoop does not support automatic recovery
> of the namenode.
> However for our situation we intend to have a mechanism
> that will detect a namenode failure. and automatically
> startup the namenode with -importcheckpoint option in the
> secondary namenode server. 
> When i say automatically it necessarily means absolutely
> no manual intervention at the point of failure and startup.


View raw message