kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Franco Venturi <fvent...@comcast.net>
Subject Error message: 'Tried to update clock beyond the max. error.'
Date Wed, 01 Nov 2017 02:12:55 GMT


A few days ago at work our Kudu servers started having fatal errors and shutting down with
the following error message: 




Couldn't get the current time: Clock unsynchronized. Status: Service unavailable: Error: Clock
synchronized but error wastoo high (10000016 us). 




After some research in the community forums, I found a post by Todd that pointed to this JIRA
issue: https://issues.apache.org/jira/browse/KUDU-2079 

I then checked our ntpd configuration and sure enough we had the '-x' option in the daemon
command, so I went ahead, removed that option, restarted ntpd, and a few minutes later I restarted
all the Kudu processes (one master and three tablet servers). 
A few minutes later a couple of those Kudu processes were down again, this time with this
new time sync related error message: 




Tried to update clock beyond the max. error. 




To try to address this new error, I brought down all the Kudu processes, stopped ntpd, resync'd
the time on all the servers with ntpdate, brought ntpd back up, waited a bit, and restarted
Kudu (master and tablet servers). A few minutes or less later a couple of them were down again
with the same 'Tried to update clock beyond the max. error.' 




I eventually ended up doubling the parameter 'max_clock_sync_error_usec' to 20,000,000 (20
seconds) and everything stayed up (and is still up). 




Looking at the source code in git, I found the relevant section here (source file https://github.com/apache/kudu/blob/master/src/kudu/clock/hybrid_clock.cc):





// we won't update our clock if to_update is more than 'max_clock_sync_error_usec' 
// into the future as it might have been corrupted or originated from an out-of-sync 
// server. 
if ((to_update_physical - now_physical) > FLAGS_max_clock_sync_error_usec) { 
return Status::InvalidArgument("Tried to update clock beyond the max. error."); 
} 




If I understand this code correctly, it is complaining because for some reason Kudu is trying
to update its clock by more than 10 seconds - however I ran ntptime and several ntpq queries,
and I don't see the time between the servers being off by that much (or even by say half a
second, since they are all synchronized with a stratum 3 NTP server). 




Has anyone in this group seen anything similar or does anyone have a better understanding
of what this message means and what could be causing it? 




Thanks, 
Franco 

Mime
View raw message