hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ramkrishna vasudevan <ramkrishna.s.vasude...@gmail.com>
Subject Re: Backup HMasters will go down if the zk connection expires without recovery
Date Fri, 21 Mar 2014 04:01:45 GMT
We discussed this internally too.  May be the intention was to see if
through code it can be handled.  Generally the management of these back up
master can be done outside of HBase through monitoring services.
@Jingcheng
What do you think?

Regards
Ram


On Fri, Mar 21, 2014 at 3:25 AM, Enis Söztutar <enis.soz@gmail.com> wrote:

> Zk session recovery in the active master was added some time ago, but it
> requires a complex state management in regards to what services inside
> master to reinitialize or keep. We discussed that we should remove it
> altogether since this increases the code complexity by a lot, and makes the
> recovery from zk session lost very error prone (a remember 1-2 issues
> fixing this area).
>
> I think architecturally, we remove zk session recovery from active master,
> and not add this to backup masters at all. Another service, like Ambari, or
> a supervisor should be responsible to bring the master / backup master
> nodes back.
>
> Enis
>
>
> On Thu, Mar 20, 2014 at 11:35 AM, Andrew Purtell <apurtell@apache.org
> >wrote:
>
> > Why did the backup master's zookeeper session expire? That indicates a
> > problem somewhere on the network or with zookeeper.
> >
> > The active master and regionservers also shut down when their sessions
> > expire. If our zookeeper session expires we have been partitioned and
> have
> > a high degree of uncertainty from our vantage point on the state of the
> > world. We shut down to avoid accidentally taking incorrect actions with
> bad
> > or out of date state. This simplifies design and removes corner cases.
>  In
> > a production environment I would expect a site local strategy (could be
> > daemontools etc.) for automatic service recovery, if that is desired.
> >
> >
> >
> > On Thu, Mar 20, 2014 at 12:43 AM, Du, Jingcheng <jingcheng.du@intel.com
> > >wrote:
> >
> > > Dear Devs,
> > >
> > >   Now I encounter a problem in the HMaster.
> > >   Currently I run multiple HMasters in a cluster. If the ZK connection
> of
> > > one of the backup HMasters expires, this backup HMaster will go down
> > > directly without recovering the ZK connection.
> > > I saw there were such code in the HMaster.abortNow() listed below, the
> > > fail.fast only works for active HMaster. Do the backup ones need to be
> > > recovered if the zk connection expires? Please advise. Thanks.
> > >
> > > if (!this.isActiveMaster || this.stopped) {
> > >       return true;
> > >     }
> > > boolean failFast = conf.getBoolean("fail.fast.expired.active.master",
> > > false);
> > >
> > >
> > > Regards,
> > > Jingcheng
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message