hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Whiting <je...@qualtrics.com>
Subject Re: Sanity date time check when a region server joins the cluster
Date Fri, 29 Oct 2010 17:58:05 GMT
Created HBASE-3168 for this issue.  It seems pretty straight forward and I wouldn't mind tackling

this problem.  How much of a skew do we want to allow between the RS and the rest of the cluster?


On 10/28/2010 12:08 PM, Jonathan Gray wrote:
> I was discussing this exact issue this morning.  Ran into a problem where master was
timing out a region in transition because the RS was 5 minutes behind the master.
> I like the idea of the RS sending it's timestamp on startup and if it is outside a certain
threshold, the master throws it a ClockOutOfSync-like exception and the RS goes down.
> Please do file a jira, Jeff.  Or let me know and I can do it.
> JG
>> -----Original Message-----
>> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-
>> Daniel Cryans
>> Sent: Thursday, October 28, 2010 10:00 AM
>> To: user@hbase.apache.org
>> Subject: Re: Sanity date time check when a region server joins the
>> cluster
>> That could be done easily when the server checks in by looking at the
>> given start code. In ServerManager we already do:
>>      HServerInfo info = new HServerInfo(serverInfo);
>>      checkIsDead(info.getServerName(), "STARTUP");
>>      checkAlreadySameHostPort(info);
>>      recordNewServer(info, false, null);
>> A new check in there would fit nicely. Can you open a jira Jeff?
>> Thx!
>> J-D
>> On Thu, Oct 28, 2010 at 9:56 AM, Jeff Whiting<jeffw@qualtrics.com>
>> wrote:
>>> We recently had a problem where one of our machines in the cluster
>> had a
>>> time that was 6 hours behind the other ones (ntp was supposed to be
>> setup on
>>> that machine but wasn't).  We subsequently restarted our cluster and
>> the
>>> '-ROOT-' table was assigned to that machine.  The problem was that
>> when it
>>> tried to update the value (info:server) for who was holding the
>> '.META.'
>>> table the value wasn't updating and stayed set as the previous
>> machine. I'm
>>> pretty sure the problem was the timestamp for the new server was
>> older than
>>> the timestamp for the previous server preventing the value from
>> updating
>>> correctly.  Having the incorrect info:server in the ROOT table
>> basically
>>> made the cluster unusable.
>>> So my question is, would it make sense to have a sanity time check
>> when a
>>> region server joins the cluster?  Basically when the region server
>> joins it
>>> would sent its current time and the master would check that time
>> against its
>>> current time and if difference is too large then it would prevent the
>> region
>>> server from joining.  I know this is basic server configuration stuff
>> but
>>> because of human error these things happen and seem like they can
>> cause
>>> major problems for the cluster if the servers times aren't
>> synchronized.
>>> ~Jeff
>>> --
>>> Jeff Whiting
>>> Qualtrics Senior Software Engineer
>>> jeffw@qualtrics.com

Jeff Whiting
Qualtrics Senior Software Engineer

View raw message