hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enis Söztutar <enis....@gmail.com>
Subject Re: Backup HMasters will go down if the zk connection expires without recovery
Date Thu, 20 Mar 2014 21:55:55 GMT
Zk session recovery in the active master was added some time ago, but it
requires a complex state management in regards to what services inside
master to reinitialize or keep. We discussed that we should remove it
altogether since this increases the code complexity by a lot, and makes the
recovery from zk session lost very error prone (a remember 1-2 issues
fixing this area).

I think architecturally, we remove zk session recovery from active master,
and not add this to backup masters at all. Another service, like Ambari, or
a supervisor should be responsible to bring the master / backup master
nodes back.

Enis


On Thu, Mar 20, 2014 at 11:35 AM, Andrew Purtell <apurtell@apache.org>wrote:

> Why did the backup master's zookeeper session expire? That indicates a
> problem somewhere on the network or with zookeeper.
>
> The active master and regionservers also shut down when their sessions
> expire. If our zookeeper session expires we have been partitioned and have
> a high degree of uncertainty from our vantage point on the state of the
> world. We shut down to avoid accidentally taking incorrect actions with bad
> or out of date state. This simplifies design and removes corner cases.  In
> a production environment I would expect a site local strategy (could be
> daemontools etc.) for automatic service recovery, if that is desired.
>
>
>
> On Thu, Mar 20, 2014 at 12:43 AM, Du, Jingcheng <jingcheng.du@intel.com
> >wrote:
>
> > Dear Devs,
> >
> >   Now I encounter a problem in the HMaster.
> >   Currently I run multiple HMasters in a cluster. If the ZK connection of
> > one of the backup HMasters expires, this backup HMaster will go down
> > directly without recovering the ZK connection.
> > I saw there were such code in the HMaster.abortNow() listed below, the
> > fail.fast only works for active HMaster. Do the backup ones need to be
> > recovered if the zk connection expires? Please advise. Thanks.
> >
> > if (!this.isActiveMaster || this.stopped) {
> >       return true;
> >     }
> > boolean failFast = conf.getBoolean("fail.fast.expired.active.master",
> > false);
> >
> >
> > Regards,
> > Jingcheng
> >
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message