hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Kerzner <markkerz...@gmail.com>
Subject Re: Is this a fair summary of HDFS failover?
Date Mon, 14 Feb 2011 22:52:43 GMT
I completely agree, and I am using yours and the group's posting to define
the direction and approaches, but I am also trying every solution - and I am
beginning to do just that, the AvatarNode now.

Thank you,
Mark

On Mon, Feb 14, 2011 at 4:43 PM, M. C. Srivas <mcsrivas@gmail.com> wrote:

> I understand you are writing a book "Hadoop in Practice".  If so, its
> important that what's recommended in the book should be verified in
> practice. (I mean, beyond simply posting in this newsgroup - for instance,
> the recommendations on NN fail-over should be tried out first before
> writing
> about how to do it). Otherwise you won't know your recommendations really
> work or not.
>
>
>
> On Mon, Feb 14, 2011 at 12:31 PM, Mark Kerzner <markkerzner@gmail.com
> >wrote:
>
> > Thank you, M. C. Srivas, that was enormously useful. I understand it now,
> > but just to be complete, I have re-formulated my points according to your
> > comments:
> >
> >   - In 0.20 the Secondary NameNode performs snapshotting. Its data can be
> >   used to recreate the HDFS if the Primary NameNode fails. The procedure
> is
> >   manual and may take hours, and there is also data loss since the last
> >   snapshot;
> >   - In 0.21 there is a Backup Node (HADOOP-4539), which aims to help with
> >   HA and act as a cold spare. The data loss is less than with Secondary
> NN,
> >   but it is still manual and potentially error-prone, and it takes hours;
> >   - There is an AvatarNode patch available for 0.20, and Facebook runs
> its
> >   cluster that way, but the patch submitted to Apache requires testing
> and
> > the
> >   developers adopting it must do some custom configurations and also
> > exercise
> >   care in their work.
> >
> > As a conclusion, when building an HA HDFS cluster, one needs to follow
> the
> > best
> > practices outlined by Tom
> > White<
> > http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
> >,
> > and may still need to resort to specialized NSF filers for running the
> > NameNode.
> >
> > Sincerely,
> > Mark
> >
> >
> >
> > On Mon, Feb 14, 2011 at 11:50 AM, M. C. Srivas <mcsrivas@gmail.com>
> wrote:
> >
> > > The summary is quite inaccurate.
> > >
> > > On Mon, Feb 14, 2011 at 8:48 AM, Mark Kerzner <markkerzner@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > is it accurate to say that
> > > >
> > > >   - In 0.20 the Secondary NameNode acts as a cold spare; it can be
> used
> > > to
> > > >   recreate the HDFS if the Primary NameNode fails, but with the delay
> > of
> > > >   minutes if not hours, and there is also some data loss;
> > > >
> > >
> > >
> > > The Secondary NN is not a spare. It is used to augment the work of the
> > > Primary, by offloading some of its work to another machine. The work
> > > offloaded is "log rollup" or "checkpointing". This has been a source of
> > > constant confusion (some named it incorrectly as a "secondary" and now
> we
> > > are stuck with it).
> > >
> > > The Secondary NN certainly cannot take over for the Primary. It is not
> > its
> > > purpose.
> > >
> > > Yes, there is data loss.
> > >
> > >
> > >
> > >
> > > >   - in 0.21 there are streaming edits to a Backup Node (HADOOP-4539),
> > > which
> > > >   replaces the Secondary NameNode. The Backup Node can be used as a
> > warm
> > > >   spare, with the failover being a matter of seconds. There can be
> > > multiple
> > > >   Backup Nodes, for additional insurance against failure, and
> previous
> > > best
> > > >   common practices apply to it;
> > > >
> > >
> > >
> > > There is no "Backup NN" in the manner you are thinking of. It is
> > completely
> > > manual, and requires restart of the "whole world", and takes about 2-3
> > > hours
> > > to happen. If you are lucky, you may have only a little data loss
> (people
> > > have lost entire clusters due to this -- from what I understand, you
> are
> > > far
> > > better off resurrecting the Primary instead of trying to bring up a
> > Backup
> > > NN).
> > >
> > > In any case, when you run it like you mention above, you will have to
> > > (a) make sure that the primary is dead
> > > (b) edit hdfs-site.xml on *every* datanode to point to the new IP
> address
> > > of
> > > the backup, and restart each datanode.
> > > (c) wait for 2-3 hours for all the block-reports from every restarted
> DN
> > to
> > > finish
> > >
> > > 2-3 hrs afterwards:
> > > (d) after that, restart all TT and the JT to connect to the new NN
> > > (e) finally, restart all the clients (eg, HBase, Oozie, etc)
> > >
> > > Many companies, including Yahoo! and Facebook, use a couple of NetApp
> > > filers
> > > to hold the actual data that the NN writes. The two NetApp filers are
> run
> > > in
> > > "HA" mode with NVRAM copying.  But the NN remains a single point of
> > > failure,
> > > and there is probably some data loss.
> > >
> > >
> > >
> > > >   - 0.22 will have further improvements to the HDFS performance, such
> > > >   as HDFS-1093.
> > > >
> > > > Does the paper on HDFS Reliability by Tom
> > > > White<
> > > >
> > http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
> > > > >still
> > > > represent the current state of things?
> > > >
> > >
> > >
> > > See Dhruba's blog-post about the Avatar NN + some custom "stackable
> HDFS"
> > > code on all the clients + Zookeeper + the dual NetApp filers.
> > >
> > > It helps Facebook do manual, controlled, fail-over during software
> > > upgrades,
> > > at the cost of some performance loss on the DataNodes (the DataNodes
> have
> > > to
> > > do 2x block reports, and each block-report is expensive, so it limits
> the
> > > DataNode a bit).  The article does not talk about dataloss when the
> > > fail-over is initiated manually, so I don't know about that.
> > >
> > >
> > >
> >
> http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html
> > >
> > >
> > >
> > >
> > > >
> > > > Thank you. Sincerely,
> > > > Mark
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message