hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Question about fault tolerance and fail over for name nodes
Date Wed, 30 Jul 2008 10:38:52 GMT
Andreas Kostyrka wrote:
> On Tuesday 29 July 2008 18:22:07 Paco NATHAN wrote:
>> Jason,
>>
>> FWIW -- based on a daily batch process, requiring 9 Hadoop jobs in
>> sequence -- 100+2 EC2 nodes, 2 Tb data, 6 hrs run time.
>>
>> We tend to see a namenode failing early, e.g., the "problem advancing"
>> exception in the values iterator, particularly during a reduce phase.
>>
>> Hot-fail would be great. Otherwise, given the duration of our batch
>> job overall, we use what you describe: shut down cluster, etc.
>>
>> Would prefer to observe this kind of failure sooner than later. We've
>> discussed internally how to craft an initial job which could stress
>> test the namenode.  Think of a "unit test" for the cluster.
> 
> ssh namenode 'kill -9 $(ps ax | grep java.*NameNode | cut -f 1 -d " " )'
> 
> Here goes your namenode failure, if you just want to do the exercise for a 
> failover ;)

Simulating network partitioning can be more interesting, as then your 
failover tools have to deal with the risk that there are now two 
machines that think they are in charge. This is why building 
High-Availability and fault-tolerant systems are tricky.

-- 
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

Mime
View raw message