hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Whiting (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-3168) Sanity date and time check when a region server joins the cluster
Date Mon, 01 Nov 2010 20:06:24 GMT

     [ https://issues.apache.org/jira/browse/HBASE-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jeff Whiting updated HBASE-3168:
--------------------------------

    Attachment: HBASE-3168-trunk-v1.txt

This is my first attempt at a fix for this issue.  I wrote a unit test and I think the fundamental
idea is solid.  However I have some concerns.  

#1 HServerInfo is instantiated HRegionServer constructor with the current time as the startCode.
 However it isn't until reportForDuty() is called that the serverInfo is sent to the master.
 I don't know how much time there is in between the constructor and the reportForDuty call.

#2 It seems like there could be problems during different fail scenarios.  So lets say the
master dies and then it is restarted (or the backup takes its place).  If each region server
then calls reportsForDuty the startCode could be really old and then it wouldn't be allowed
to join.  In effect any situation where the region server has to (re)join the master after
it has been running a while would cause the region server to be rejected because the startCode
is the time when it was started not when it tried to join.

I'm not super familiar with hbase so some of my concerns may not be valid and there may be
other ones I wouldn't know about.  The basic problem as I see it is the startCode getting
stale and the master rejecting the server when it shouldn't.


> Sanity date and time check when a region server joins the cluster
> -----------------------------------------------------------------
>
>                 Key: HBASE-3168
>                 URL: https://issues.apache.org/jira/browse/HBASE-3168
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>    Affects Versions: 0.89.20100924
>         Environment: RHEL 5.5 64bit, 1 Master 4 Region Servers
>            Reporter: Jeff Whiting
>         Attachments: HBASE-3168-trunk-v1.txt
>
>
> Introduce a sanity check when a RS joins the cluster to make sure its clock isn't too
far out of skew with the rest of the cluster.  If the RS's time is too far out of skew then
the master would prevent it from joining and RS would die and log the error. 
> Having a RS with even small differences in time can cause huge problems due to how bhase
stores values with timestamps.
> According to J-D in ServerManager we are already doing: 
> {code}
>     HServerInfo info = new HServerInfo(serverInfo);
>     checkIsDead(info.getServerName(), "STARTUP");
>     checkAlreadySameHostPort(info);
>     recordNewServer(info, false, null);
> {code}
> And that the new check would fit in nicely there.
> JG suggests we add a "ClockOutOfSync-like exception"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message