Zhanlei Ma wrote:
> HI all:
> We know the fault tolence of hadoop when a datanode is failed, as running hadoop.
>Howerver which source part resolving the situation? And how to solve?
stuff in hadoop-hdfs; get on the hdfs-dev@hadoop.apache.org mailing list
to discuss these problems.
Namenode: needs some way of feeding changes to peers in a (changing,
volatile) set of namenodes such that any can failover to become the
primary, or you do a complete -any NN accepts operations- world and you
are in the domain of HA distributed databases, along with the
performance problems
datanode: rebinding on failure
dfsclient: move away from a single DNS URL and do some rebinding lookup
This is all serious code and hard to test; everyone running production
clusters would like HA, but not at the expense of performance *or any of
their existing data*. You need to start talking to Yahoo! and Facebook
before you begin touching the code to understand their needs, as the big
datacentre teams will veto any change that doesn't work for them. The
problems of small clusters, where small is a few hundred TB on tens of
servers, are considered dealt with and it is the big facilities where
the problems turn up. This is why the historical focus has been on rapid
recovery and (ideally) zero data loss over continuous availability
-steve
|