hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jg...@facebook.com>
Subject RE: Sanity date time check when a region server joins the cluster
Date Thu, 28 Oct 2010 18:08:13 GMT
I was discussing this exact issue this morning.  Ran into a problem where master was timing
out a region in transition because the RS was 5 minutes behind the master.

I like the idea of the RS sending it's timestamp on startup and if it is outside a certain
threshold, the master throws it a ClockOutOfSync-like exception and the RS goes down.

Please do file a jira, Jeff.  Or let me know and I can do it.

JG

> -----Original Message-----
> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-
> Daniel Cryans
> Sent: Thursday, October 28, 2010 10:00 AM
> To: user@hbase.apache.org
> Subject: Re: Sanity date time check when a region server joins the
> cluster
> 
> That could be done easily when the server checks in by looking at the
> given start code. In ServerManager we already do:
> 
>     HServerInfo info = new HServerInfo(serverInfo);
>     checkIsDead(info.getServerName(), "STARTUP");
>     checkAlreadySameHostPort(info);
>     recordNewServer(info, false, null);
> 
> A new check in there would fit nicely. Can you open a jira Jeff?
> 
> Thx!
> 
> J-D
> 
> On Thu, Oct 28, 2010 at 9:56 AM, Jeff Whiting <jeffw@qualtrics.com>
> wrote:
> > We recently had a problem where one of our machines in the cluster
> had a
> > time that was 6 hours behind the other ones (ntp was supposed to be
> setup on
> > that machine but wasn't).  We subsequently restarted our cluster and
> the
> > '-ROOT-' table was assigned to that machine.  The problem was that
> when it
> > tried to update the value (info:server) for who was holding the
> '.META.'
> > table the value wasn't updating and stayed set as the previous
> machine. I'm
> > pretty sure the problem was the timestamp for the new server was
> older than
> > the timestamp for the previous server preventing the value from
> updating
> > correctly.  Having the incorrect info:server in the ROOT table
> basically
> > made the cluster unusable.
> >
> > So my question is, would it make sense to have a sanity time check
> when a
> > region server joins the cluster?  Basically when the region server
> joins it
> > would sent its current time and the master would check that time
> against its
> > current time and if difference is too large then it would prevent the
> region
> > server from joining.  I know this is basic server configuration stuff
> but
> > because of human error these things happen and seem like they can
> cause
> > major problems for the cluster if the servers times aren't
> synchronized.
> >
> > ~Jeff
> >
> > --
> >
> > Jeff Whiting
> > Qualtrics Senior Software Engineer
> > jeffw@qualtrics.com
> >
> >

Mime
View raw message