kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexey Serbin <aser...@cloudera.com>
Subject Re: tserver died by clock unsync.
Date Fri, 16 Jun 2017 19:53:40 GMT
Hi Jason,

I think the workaround you mentioned (i.e. replacing LOG(FATAL) with 
LOG(WARNING) in the cited code snippet) is not safe at all.  If 
ntp_gettime() returns TIME_ERROR code, that means the 'now_usec' 
variable might be left uninitialized, and the code relying on the 
HybridClock::NowWithError() method would get some garbage instead of 
wall clock usec value.  That might lead to serious issues elsewhere up 
the chain, and it's hard to predict what would happen.  If you are 
lucky, a tserver will crash just later on, if not -- you'll get 
undefined behavior and data corruption which would be very hard to track 
and fix.

Instead of running your tservers with that unsafe change, I would 
recommend to track down the issue with the NTP in your cluster. Make 
sure there isn't other clock drives on your machines besides ntpd (e.g., 
make sure nobody runs ntpdate manually and ntpdate is not executed by a 
cron job, etc.).  If your local network experiences internet outages for 
long periods of time, one suggestion might be running NTP server on a 
stable machine (or two) within your local network.  Your local NTP 
servers would source time from 5-7 public NTP servers of stratum 2 or 3 
from the internet.  In their turn, the NTP servers at your Kudu nodes 
would use your internal NTP server(s) as a source. Also, it would make 
sense to take a look at some 'NTP best practice' guides you could find 
elsewhere on the Internet -- hopefully, you could find some ideas how to 
tailor those for you case.

Hope this helps.


Kind regards,

Alexey


On 6/16/17 1:59 AM, Jason Heo wrote:
> Hi.
>
> Congrat. Apache Kudu 1.4.0
>
> To prevent tserver from dying accidentally, I've changed LOG(FATAL) 
> <https://github.com/apache/kudu/blob/1.4.0/src/kudu/server/hybrid_clock.cc#L227>

> to LOG(WARNING)
>
> I wanted to know it is safe to continue if ntp_gettime() in 
> GetClockTime 
> <https://github.com/apache/kudu/blob/1.4.0/src/kudu/server/hybrid_clock.cc#L90>

> returns TIME_ERROR
>
> Could anyone can help me?
>
> Regards,
>
> Jason
>
>
>
> 2017-06-15 12:40 GMT+09:00 Jason Heo <jason.heo.sde@gmail.com 
> <mailto:jason.heo.sde@gmail.com>>:
>
>     Hi,
>
>     I'm using Apache Kudu 1.4.0
>
>     Yesterday, 6 tservers die at the same time. Following message is
>     logged for each tserver.
>
>
>     F0614 14:58:32.868551 111454 hybrid_clock.cc:227]
>
>     Couldn't get the current time: Clock unsynchronized.
>
>     Status: Service unavailable:
>
>     Error reading clock. Clock considered unsynchronized
>
>
>     We are already using ntpd, and in /var/log/messages, ntpd related
>     message is logged.
>
>     Jun 14 14:58:38 hostname ntpdate[10231]: step time server ip_addr
>     offset -0.000168 sec
>
>
>     We use our own ntp service. I don't know what's the exact reason,
>     but It's suspicious that our ntp service is malfunctioned or
>     network is not good temporarily.
>
>     The problem is that this could happen again and again.
>
>     So, I'm considering modifying source code of Kudu from LOG(FATAL)
>     to LOG(WARN) so that tserver does not exit on unsync.
>
>       uint64_t now_usec;
>
>       uint64_t error_usec;
>
>       Status s = WalltimeWithError(&now_usec, &error_usec);
>
>       if (PREDICT_FALSE(!s.ok())) {
>
>     LOG(FATAL)<< Substitute("Couldn't get the current time: Clock
>     unsynchronized. "
>
>             "Status: $0", s.ToString());
>
>       }
>
>
>
>     So, I question is that is it OK modifying LOG(FATAL) to LOG(WARN)
>     of above code? and wanted to know this can preventing from dying
>     of tserver when clock unsynced?
>
>     Thanks.
>
>     Jason,
>
>     Regard
>
>


Mime
View raw message