From Michel Segel <michael_se...@hotmail.com>
Subject Re: hadoop namenode recovery
Date Thu, 17 Jan 2013 10:26:52 GMT
MapR was the first vendor to remove the NN as a SPOF.
They did this w their 1.0 release when it first came out. The downside is that their release
is proprietary and very different in terms of the underlying architecture from Apace based

Horton works relies on VMware as a key piece of their release.

If you want HA, you are going to have to look at a specific vendor's approach.

On Jan 17, 2013, at 3:53 AM, randy <randysch@comcast.net> wrote:

> I've seen NFS get in a state many times where the mount is still there, but it can't
be written to or accessed. What happens in that case? If the network is congested or slow,
does that slow down the overall NN performance?
> Thanks,
> randy
> On 01/15/2013 11:14 PM, Harsh J wrote:
>> The NFS mount is to be soft-mounted; so if the NFS goes down, the NN
>> ejects it out and continues with the local disk. If auto-restore is
>> configured, it will re-add the NFS if its detected good again later.
>> On Wed, Jan 16, 2013 at 7:04 AM, randy <randysch@comcast.net
>> <mailto:randysch@comcast.net>> wrote:
>>    What happens to the NN and/or performance if there's a problem with
>>    the NFS server? Or the network?
>>    Thanks,
>>    randy
>>    On 01/14/2013 11:36 PM, Harsh J wrote:
>>        Its very rare to observe an NN crash due to a software bug in
>>        production. Most of the times its a hardware fault you should
>>        worry about.
>>        On 1.x, or any non-HA-carrying release, the best you can get to
>>        safeguard against a total loss is to have redundant disk volumes
>>        configured, one preferably over a dedicated remote NFS mount.
>>        This way
>>        the NN is recoverable after the node goes down, since you can
>>        retrieve a
>>        current copy from another machine (i.e. via the NFS mount) and
>>        set a new
>>        node up to replace the older NN and continue along.
>>        A load balancer will not work as the NN is not a simple
>>        webserver - it
>>        maintains state which you cannot sync. We wrote HA-HDFS features to
>>        address the very concern you have.
>>        If you want true, painless HA, branch-2 is your best bet at this
>>        point.
>>        An upcoming 2.0.3 release should include the QJM based HA
>>        features that
>>        is painless to setup and very reliable to use (over other
>>        options), and
>>        works with commodity level hardware. FWIW, we've (my team and I)
>>        been
>>        supporting several users and customers who're running the 2.x
>>        based HA
>>        in production and other types of environments and it has been
>>        greatly
>>        stable in our experience. There are also some folks in the community
>>        running 2.x based HDFS for HA/else.
>>        On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper
>>        <ouchwhisper@gmail.com <mailto:ouchwhisper@gmail.com>
>>        <mailto:ouchwhisper@gmail.com <mailto:ouchwhisper@gmail.com>>__>
>>        wrote:
>>             Hello,
>>             Is there a standard way to prevent the failure of Namenode
>>        crash in
>>             a Hadoop cluster?
>>             or what is the standard or best practice for overcoming the
>>        Single
>>             point failure problem of Hadoop.
>>             I am not ready to take chances on a production server with
>>        Hadoop
>>             2.0 Alpha release, which claims to have solved the problem. Are
>>             there any other things I can do to either prevent the
>>        failure or
>>             recover from the failure in a very short time.
