hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "M. C. Srivas" <mcsri...@gmail.com>
Subject Re: Is this a fair summary of HDFS failover?
Date Mon, 14 Feb 2011 17:50:58 GMT
The summary is quite inaccurate.

On Mon, Feb 14, 2011 at 8:48 AM, Mark Kerzner <markkerzner@gmail.com> wrote:

> Hi,
> is it accurate to say that
>   - In 0.20 the Secondary NameNode acts as a cold spare; it can be used to
>   recreate the HDFS if the Primary NameNode fails, but with the delay of
>   minutes if not hours, and there is also some data loss;

The Secondary NN is not a spare. It is used to augment the work of the
Primary, by offloading some of its work to another machine. The work
offloaded is "log rollup" or "checkpointing". This has been a source of
constant confusion (some named it incorrectly as a "secondary" and now we
are stuck with it).

The Secondary NN certainly cannot take over for the Primary. It is not its

Yes, there is data loss.

>   - in 0.21 there are streaming edits to a Backup Node (HADOOP-4539), which
>   replaces the Secondary NameNode. The Backup Node can be used as a warm
>   spare, with the failover being a matter of seconds. There can be multiple
>   Backup Nodes, for additional insurance against failure, and previous best
>   common practices apply to it;

There is no "Backup NN" in the manner you are thinking of. It is completely
manual, and requires restart of the "whole world", and takes about 2-3 hours
to happen. If you are lucky, you may have only a little data loss (people
have lost entire clusters due to this -- from what I understand, you are far
better off resurrecting the Primary instead of trying to bring up a Backup

In any case, when you run it like you mention above, you will have to
(a) make sure that the primary is dead
(b) edit hdfs-site.xml on *every* datanode to point to the new IP address of
the backup, and restart each datanode.
(c) wait for 2-3 hours for all the block-reports from every restarted DN to

2-3 hrs afterwards:
(d) after that, restart all TT and the JT to connect to the new NN
(e) finally, restart all the clients (eg, HBase, Oozie, etc)

Many companies, including Yahoo! and Facebook, use a couple of NetApp filers
to hold the actual data that the NN writes. The two NetApp filers are run in
"HA" mode with NVRAM copying.  But the NN remains a single point of failure,
and there is probably some data loss.

>   - 0.22 will have further improvements to the HDFS performance, such
>   as HDFS-1093.
> Does the paper on HDFS Reliability by Tom
> White<
> http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
> >still
> represent the current state of things?

See Dhruba's blog-post about the Avatar NN + some custom "stackable HDFS"
code on all the clients + Zookeeper + the dual NetApp filers.

It helps Facebook do manual, controlled, fail-over during software upgrades,
at the cost of some performance loss on the DataNodes (the DataNodes have to
do 2x block reports, and each block-report is expensive, so it limits the
DataNode a bit).  The article does not talk about dataloss when the
fail-over is initiated manually, so I don't know about that.


> Thank you. Sincerely,
> Mark

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message