hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Dimiduk <ndimi...@gmail.com>
Subject Re: Backup HMasters will go down if the zk connection expires without recovery
Date Fri, 21 Mar 2014 16:36:51 GMT
I agree that resuming the process is best handled by site-local tooling.
Could be we do a better job of informing that tooling regarding the nature
of the failure. Well defined exit codes, for instance, may be useful.

On Thursday, March 20, 2014, Du, Jingcheng <jingcheng.du@intel.com> wrote:

> Thanks a lot for the comments.
>
> I think we could have another service or supervisor to bring the backup
> masters back when they go down.
>
> Regards,
> Jingcheng
>
> -----Original Message-----
> From: ramkrishna vasudevan [mailto:ramkrishna.s.vasudevan@gmail.com<javascript:;>
> ]
> Sent: Friday, March 21, 2014 12:02 PM
> To: dev@hbase.apache.org <javascript:;>
> Subject: Re: Backup HMasters will go down if the zk connection expires
> without recovery
>
> We discussed this internally too.  May be the intention was to see if
> through code it can be handled.  Generally the management of these back up
> master can be done outside of HBase through monitoring services.
> @Jingcheng
> What do you think?
>
> Regards
> Ram
>
>
> On Fri, Mar 21, 2014 at 3:25 AM, Enis Söztutar <enis.soz@gmail.com<javascript:;>>
> wrote:
>
> > Zk session recovery in the active master was added some time ago, but
> > it requires a complex state management in regards to what services
> > inside master to reinitialize or keep. We discussed that we should
> > remove it altogether since this increases the code complexity by a
> > lot, and makes the recovery from zk session lost very error prone (a
> > remember 1-2 issues fixing this area).
> >
> > I think architecturally, we remove zk session recovery from active
> > master, and not add this to backup masters at all. Another service,
> > like Ambari, or a supervisor should be responsible to bring the master
> > / backup master nodes back.
> >
> > Enis
> >
> >
> > On Thu, Mar 20, 2014 at 11:35 AM, Andrew Purtell <apurtell@apache.org<javascript:;>
> > >wrote:
> >
> > > Why did the backup master's zookeeper session expire? That indicates
> > > a problem somewhere on the network or with zookeeper.
> > >
> > > The active master and regionservers also shut down when their
> > > sessions expire. If our zookeeper session expires we have been
> > > partitioned and
> > have
> > > a high degree of uncertainty from our vantage point on the state of
> > > the world. We shut down to avoid accidentally taking incorrect
> > > actions with
> > bad
> > > or out of date state. This simplifies design and removes corner cases.
> >  In
> > > a production environment I would expect a site local strategy (could
> > > be daemontools etc.) for automatic service recovery, if that is
> desired.
> > >
> > >
> > >
> > > On Thu, Mar 20, 2014 at 12:43 AM, Du, Jingcheng
> > > <jingcheng.du@intel.com <javascript:;>
> > > >wrote:
> > >
> > > > Dear Devs,
> > > >
> > > >   Now I encounter a problem in the HMaster.
> > > >   Currently I run multiple HMasters in a cluster. If the ZK
> > > > connection
> > of
> > > > one of the backup HMasters expires, this backup HMaster will go
> > > > down directly without recovering the ZK connection.
> > > > I saw there were such code in the HMaster.abortNow() listed below,
> > > > the fail.fast only works for active HMaster. Do the backup ones
> > > > need to be recovered if the zk connection expires? Please advise.
> Thanks.
> > > >
> > > > if (!this.isActiveMaster || this.stopped) {
> > > >       return true;
> > > >     }
> > > > boolean failFast =
> > > > conf.getBoolean("fail.fast.expired.active.master",
> > > > false);
> > > >
> > > >
> > > > Regards,
> > > > Jingcheng
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > >    - Andy
> > >
> > > Problems worthy of attack prove their worth by hitting back. - Piet
> > > Hein (via Tom White)
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message