Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 87810 invoked from network); 28 Oct 2010 18:09:24 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 28 Oct 2010 18:09:24 -0000 Received: (qmail 15212 invoked by uid 500); 28 Oct 2010 18:09:23 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 15180 invoked by uid 500); 28 Oct 2010 18:09:23 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 15172 invoked by uid 99); 28 Oct 2010 18:09:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Oct 2010 18:09:23 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jgray@facebook.com designates 66.220.144.136 as permitted sender) Received: from [66.220.144.136] (HELO mx-out.facebook.com) (66.220.144.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Oct 2010 18:09:18 +0000 Received: from [192.168.18.198] ([192.168.18.198:8741] helo=mail.thefacebook.com) by mta018.snc4.facebook.com (envelope-from ) (ecelerity 2.2.2.45 r(34222M)) with ESMTP id 2E/E3-27367-93CB9CC4; Thu, 28 Oct 2010 11:08:57 -0700 Received: from SC-MBX04.TheFacebook.com ([169.254.3.223]) by sc-hub03.TheFacebook.com ([192.168.18.198]) with mapi id 14.01.0218.012; Thu, 28 Oct 2010 11:08:15 -0700 From: Jonathan Gray To: "user@hbase.apache.org" Subject: RE: Sanity date time check when a region server joins the cluster Thread-Topic: Sanity date time check when a region server joins the cluster Thread-Index: AQHLdsEuAV6geKgzy0KhC0EENrL0mJNXCweA//+dTrA= Date: Thu, 28 Oct 2010 18:08:13 +0000 Message-ID: <5A76F6CE309AD049AAF9A039A39242820F0B9BF1@sc-mbx04.TheFacebook.com> References: <4CC9AB49.2080506@qualtrics.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [192.168.18.252] Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 I was discussing this exact issue this morning. Ran into a problem where m= aster was timing out a region in transition because the RS was 5 minutes be= hind the master. I like the idea of the RS sending it's timestamp on startup and if it is ou= tside a certain threshold, the master throws it a ClockOutOfSync-like excep= tion and the RS goes down. Please do file a jira, Jeff. Or let me know and I can do it. JG > -----Original Message----- > From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean- > Daniel Cryans > Sent: Thursday, October 28, 2010 10:00 AM > To: user@hbase.apache.org > Subject: Re: Sanity date time check when a region server joins the > cluster >=20 > That could be done easily when the server checks in by looking at the > given start code. In ServerManager we already do: >=20 > HServerInfo info =3D new HServerInfo(serverInfo); > checkIsDead(info.getServerName(), "STARTUP"); > checkAlreadySameHostPort(info); > recordNewServer(info, false, null); >=20 > A new check in there would fit nicely. Can you open a jira Jeff? >=20 > Thx! >=20 > J-D >=20 > On Thu, Oct 28, 2010 at 9:56 AM, Jeff Whiting > wrote: > > We recently had a problem where one of our machines in the cluster > had a > > time that was 6 hours behind the other ones (ntp was supposed to be > setup on > > that machine but wasn't). =A0We subsequently restarted our cluster and > the > > '-ROOT-' table was assigned to that machine. =A0The problem was that > when it > > tried to update the value (info:server) for who was holding the > '.META.' > > table the value wasn't updating and stayed set as the previous > machine. I'm > > pretty sure the problem was the timestamp for the new server was > older than > > the timestamp for the previous server preventing the value from > updating > > correctly. =A0Having the incorrect info:server in the ROOT table > basically > > made the cluster unusable. > > > > So my question is, would it make sense to have a sanity time check > when a > > region server joins the cluster? =A0Basically when the region server > joins it > > would sent its current time and the master would check that time > against its > > current time and if difference is too large then it would prevent the > region > > server from joining. =A0I know this is basic server configuration stuff > but > > because of human error these things happen and seem like they can > cause > > major problems for the cluster if the servers times aren't > synchronized. > > > > ~Jeff > > > > -- > > > > Jeff Whiting > > Qualtrics Senior Software Engineer > > jeffw@qualtrics.com > > > >