hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: reserch the code of hadoop
Date Thu, 08 Apr 2010 10:08:20 GMT
Zhanlei Ma wrote:
> HI all:
>    We know the fault tolence of hadoop when a datanode is failed, as running hadoop.

>Howerver which source part resolving the situation? And how to solve? 

stuff in hadoop-hdfs; get on the hdfs-dev@hadoop.apache.org mailing list 
to discuss these problems.

Namenode: needs some way of feeding changes to peers in a (changing, 
volatile) set of namenodes such that any can failover to become the 
primary, or you do a complete -any NN accepts operations- world and you 
are in the domain of HA distributed databases, along with the 
performance problems

datanode: rebinding on failure

dfsclient: move away from a single DNS URL and do some rebinding lookup

This is all serious code and hard to test; everyone running production 
clusters would like HA, but not at the expense of performance *or any of 
their existing data*. You need to start talking to Yahoo! and Facebook 
before you begin touching the code to understand their needs, as the big 
datacentre teams will veto any change that doesn't work for them. The 
problems of small clusters, where small is a few hundred TB on tens of 
servers, are considered dealt with and it is the big facilities where 
the problems turn up. This is why the historical focus has been on rapid 
recovery and  (ideally) zero data loss over continuous availability


View raw message