hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stas Oskin <stas.os...@gmail.com>
Subject Re: NameNode high availability
Date Sat, 03 Oct 2009 15:07:30 GMT
Hi Edward.
Thanks for the follow-up, and for the time put in private chat. I must say
that while Xen migrate is cool, I'm really looking for HA only.

So far, the above mentioned Xen/DRBD didn't work for me reliably, and I'm
also burnt at least 3 days into it.

There are too many components that make this whole mix, and it appears that
many of them still have bugs.

Therefore, I'm really thinking to abandon this approach, and to follow-up
the above mentioned guide - as I already have the infrastruture (heartbeat
and DRBD) ready.

My only concern was so far, it that I really wanted to keep some kind of
logical separation (i.e. DataNode and NameNode), with the plan to seamlessy
migrate it to separate machine in the future. But it seems it really doesn't
worth the hassle in it.

So my new plan:

1) Drop the Xen completely, to keep the amount of possible fault layers to
minimum.

2) Keep DRBD for Xen NN and SNN metadata, posibly mixing it with MySQL
data.
This means I can get 2 HA solutions in cost of one - any comments if this
mixup healthy?

3) Create NFS exports on each server for SNN backup - in the unlikely event
that DRBD will broken on every server at once.

Additional point that I forgot to add, that in my case DRBD is over software
RAID 1, which means I have server-level HA as well.
So far this setup worked just fine.

Any comments from the list regarding the above?

Edward, as we talked, I'm also going to check out VLinux, for limiting the
processes in case they all are running on same machine.

Thanks in advance!


2009/10/3 Edward Capriolo <edlinuxguru@gmail.com>

> On Sat, Oct 3, 2009 at 1:07 AM, Otis Gospodnetic
> <otis_gospodnetic@yahoo.com> wrote:
> > Related (but not helping the immediate question).  China Telecom
> developed something they call HyperDFS.  They modified Hadoop and made it
> possible to run a cluster of NNs, thus eliminating the SPOF.
> >
> > I don't have the details - the presenter at Hadoop World (last round of
> sessions, 2nd floor) mentioned that.  Didn't give a clear answer when asked
> about contributing it back.
> >
> >  Otis
> > --
> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >
> >
> >
> > ----- Original Message ----
> >> From: Steve Loughran <stevel@apache.org>
> >> To: common-user@hadoop.apache.org
> >> Sent: Friday, October 2, 2009 7:22:45 AM
> >> Subject: Re: NameNode high availability
> >>
> >> Stas Oskin wrote:
> >> > Hi.
> >> >
> >> > The HA service (heartbeat) is running on Dom0, and when the primary
> >> > node is down, it basically just starts the VM on the other node. So
> >> > there not supposed to be any time issues.
> >> >
> >> > Can you explain a bit more about your approach, how to automate it for
> >> example?
> >>
> >> * You need to have something " a resource manager" keeping an eye on the
> NN from
> >> somewhere. Needless to say, that needs to be fairly HA too.
> >>
> >> * your NN image has to be ready to go
> >>
> >> * when the deployed NA goes away, bring up a new machine with the same
> image,
> >> hostname *and IP Address*. You can't always pull the latter off, it
> depends on
> >> the infrastructure. Without that, you'd need to bring up all the nodes
> with DNS
> >> caching set to a short time and update a DNS entry.
> >>
> >> This isn't real HA, its recovery.
> >
> >
>
> Stas,
>
> I think your setup does work but there are some inherent complexities
> that are not accounted for. For instance, is that NameNode meta data
> is not written to disk in a transactional fashion. Thus, even though
> you have block level replication you can not be sure that the
> underlying name node data is in a consistent state. (Should not be an
> issue with live-migrate though)
>
> I worked with linux-ha, DRBD, OCFS2 and many HA technologies. So let
> me summarize my experiences. We know the concept is a normal
> standalone system has components that fail, but the mean time to
> failure is high, say 99.9 normally disk components fail most often,
> solid state or a RAID1 say 99.99. If you look really hard at what the
> namenode does, a massive hadoop file system may be only a few hundred
> MB to GB of NameNode data. Assuming you have a hot spare restoring a
> few GB of data would not take very long. (large volumes of data are
> tricky because they take longer to restore)
>
> If XEN LiveMigrate works it is a slam dunk and very cool! You might
> not even miss a ping. But lets say your NOT doing a live migration,
> down the xen instance and bring it up on the other node. The failover
> might take 15 seconds, but it may work more reliably.
>
> From experience, I will tell you on a different project I went way
> cutting edge DRBD, Linux-HA, OCFS2 for multiple mounting the same file
> system simultaneously on two nodes. It worked ok, but I sunk weeks of
> research into it and it was very complex. OCFS2 conf files, HA Conf
> Files, DRDB Conf files, kernel modules, and no matter how much docs I
> wrote no one could follow it but me.
>
> My lesson learned was I really did not need that sub-second failover,
> nor did i need to be able to duel mount the drive. With those things
> stripped out I had better results better. So you might want to
> consider. Do you need live-migrate? do you need xen?
>
> This doc tries using less moving parts
> http://www.cloudera.com/blog/2009/07/22/hadoop-ha-configuration/
> context web presented at Hadoop World NYC they mentioned that with
> this setup they had 6 failures, 3 planned 3 unplanned and it worked
> each time.
>
> FYI- I did imply that the DRBD XEND approach should work before. Sorry
> for being misleading. I find people who have mixed results with some
> of these HA tools. I have them working in cases and then in other edge
> cases they do not. Your mileage may vary.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message