kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Heo <jason.heo....@gmail.com>
Subject Re: tserver died by clock unsync.
Date Sat, 17 Jun 2017 06:15:34 GMT
Hi Alexey,

Thank you for your kind answer!

Best,

Jason

2017-06-17 4:53 GMT+09:00 Alexey Serbin <aserbin@cloudera.com>:

> Hi Jason,
>
> I think the workaround you mentioned (i.e. replacing LOG(FATAL) with
> LOG(WARNING) in the cited code snippet) is not safe at all.  If
> ntp_gettime() returns TIME_ERROR code, that means the 'now_usec' variable
> might be left uninitialized, and the code relying on the
> HybridClock::NowWithError() method would get some garbage instead of wall
> clock usec value.  That might lead to serious issues elsewhere up the
> chain, and it's hard to predict what would happen.  If you are lucky, a
> tserver will crash just later on, if not -- you'll get undefined behavior
> and data corruption which would be very hard to track and fix.
>
> Instead of running your tservers with that unsafe change, I would
> recommend to track down the issue with the NTP in your cluster. Make sure
> there isn't other clock drives on your machines besides ntpd (e.g., make
> sure nobody runs ntpdate manually and ntpdate is not executed by a cron
> job, etc.).  If your local network experiences internet outages for long
> periods of time, one suggestion might be running NTP server on a stable
> machine (or two) within your local network.  Your local NTP servers would
> source time from 5-7 public NTP servers of stratum 2 or 3 from the
> internet.  In their turn, the NTP servers at your Kudu nodes would use your
> internal NTP server(s) as a source. Also, it would make sense to take a
> look at some 'NTP best practice' guides you could find elsewhere on the
> Internet -- hopefully, you could find some ideas how to tailor those for
> you case.
>
> Hope this helps.
>
>
> Kind regards,
>
> Alexey
>
>
> On 6/16/17 1:59 AM, Jason Heo wrote:
>
>> Hi.
>>
>> Congrat. Apache Kudu 1.4.0
>>
>> To prevent tserver from dying accidentally, I've changed LOG(FATAL) <
>> https://github.com/apache/kudu/blob/1.4.0/src/kudu/server/
>> hybrid_clock.cc#L227> to LOG(WARNING)
>>
>> I wanted to know it is safe to continue if ntp_gettime() in GetClockTime <
>> https://github.com/apache/kudu/blob/1.4.0/src/kudu/server/
>> hybrid_clock.cc#L90> returns TIME_ERROR
>>
>> Could anyone can help me?
>>
>> Regards,
>>
>> Jason
>>
>>
>>
>> 2017-06-15 12:40 GMT+09:00 Jason Heo <jason.heo.sde@gmail.com <mailto:
>> jason.heo.sde@gmail.com>>:
>>
>>
>>     Hi,
>>
>>     I'm using Apache Kudu 1.4.0
>>
>>     Yesterday, 6 tservers die at the same time. Following message is
>>     logged for each tserver.
>>
>>
>>     F0614 14:58:32.868551 111454 hybrid_clock.cc:227]
>>
>>     Couldn't get the current time: Clock unsynchronized.
>>
>>     Status: Service unavailable:
>>
>>     Error reading clock. Clock considered unsynchronized
>>
>>
>>     We are already using ntpd, and in /var/log/messages, ntpd related
>>     message is logged.
>>
>>     Jun 14 14:58:38 hostname ntpdate[10231]: step time server ip_addr
>>     offset -0.000168 sec
>>
>>
>>     We use our own ntp service. I don't know what's the exact reason,
>>     but It's suspicious that our ntp service is malfunctioned or
>>     network is not good temporarily.
>>
>>     The problem is that this could happen again and again.
>>
>>     So, I'm considering modifying source code of Kudu from LOG(FATAL)
>>     to LOG(WARN) so that tserver does not exit on unsync.
>>
>>       uint64_t now_usec;
>>
>>       uint64_t error_usec;
>>
>>       Status s = WalltimeWithError(&now_usec, &error_usec);
>>
>>       if (PREDICT_FALSE(!s.ok())) {
>>
>>     LOG(FATAL)<< Substitute("Couldn't get the current time: Clock
>>     unsynchronized. "
>>
>>             "Status: $0", s.ToString());
>>
>>       }
>>
>>
>>
>>     So, I question is that is it OK modifying LOG(FATAL) to LOG(WARN)
>>     of above code? and wanted to know this can preventing from dying
>>     of tserver when clock unsynced?
>>
>>     Thanks.
>>
>>     Jason,
>>
>>     Regard
>>
>>
>>
>

Mime
View raw message