kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: kudu master crashes
Date Wed, 12 Oct 2016 04:36:58 GMT
Hey Darren,

I agree it would be nice to "freeze" rather than fully exit in the case
that NTP has some temporary issues. Actually Song Zhang already filed this
request a little while back if you want to follow it:

https://issues.apache.org/jira/browse/KUDU-1578

One thing that's worth noting about NTP is that adding extra time sources
won't decrease the quality of your synchronization, even if they are "far
away" ones (eg the public NTP pool). It uses an algorithm called Marzullo's
Algorithm[1] to find the best estimate of time among the possible sources.
So even if you just run one server internally, you could add a few public
ones on top to give some extra redundancy.

-Todd

[1] https://en.wikipedia.org/wiki/Marzullo%27s_algorithm



On Tue, Oct 11, 2016 at 11:24 AM, Darren Hoo <darren.hoo@gmail.com> wrote:

> Hi Todd,
>
> Thanks for the info.
>
> kudu master will refuse to start if clock is out of sync, but will kudu
> master exit abruptly if the clock drifts  when kudu master is running?
>
> We have only one NTP server running and all other nodes in the cluster
> synchronized to this server, I shall check ntp manuals
> and setup multiple NTP servers.
>
> On Wed, Oct 12, 2016 at 12:12 AM, Todd Lipcon <todd@cloudera.com> wrote:
>
>> Hi Darren,
>>
>> It sounds like the server must have briefly lost NTP synchronization. As
>> far as I know, Cloudera Manager's alert doesn't check for the status as
>> reported by ntptime, but rather checks that the agent and the CM master
>> have relatively close clocks. Even if the kudu server lost sync for a
>> couple minutes, the clock probably didn't drift enough to trigger CM's
>> warning.
>>
>> Do you already have multiple NTP servers configured in your ntp
>> configuration? That's usually helpful for better redundancy.
>>
>> -Todd
>>
>>
>>
>> On Tue, Oct 11, 2016 at 1:01 AM, Darren Hoo <darren.hoo@gmail.com> wrote:
>>
>>> It seems that it's caused by this:
>>>
>>> + exec /opt/cloudera/parcels/KUDU-1.0.0-1.kudu1.0.0.p0.6/lib/kudu/sbin/kudu-master
>>> --master_addresses=nm-new,snm-new --flagfile=/var/run/cloudera-s
>>> cm-agent/process/11172-kudu-KUDU_MASTER/gflagfile
>>>
>>> F1011 14:35:27.318984 32076 hybrid_clock.cc:227] Couldn't get the
>>> current time: Clock unsynchronized. Status: Service unavailable: Error
>>> reading clock. Clock considered unsynchronized
>>>
>>> *** Check failure stack trace: ***
>>>
>>>     @           0x7e6a2d  google::LogMessage::Fail()
>>>
>>>     @           0x7e892d  google::LogMessage::SendToLog()
>>>
>>>     @           0x7e6569  google::LogMessage::Flush()
>>>
>>>     @           0x7e93cf  google::LogMessageFatal::~LogMessageFatal()
>>>
>>>     @           0xa1a56e  kudu::server::HybridClock::NowWithError()
>>>
>>>     @           0xa1b973  kudu::server::HybridClock::NowForMetrics()
>>>
>>>     @           0x85f8f4  kudu::FunctionGauge<>::WriteValue()
>>>
>>>     @          0x1916700  kudu::Gauge::WriteAsJson()
>>>
>>>     @          0x1917d65  kudu::MetricEntity::WriteAsJson()
>>>
>>>     @          0x1919271  kudu::MetricRegistry::WriteAsJson()
>>>
>>>     @           0x995721  (unknown)
>>>
>>>     @           0x98e5f6  kudu::Webserver::RunPathHandler()
>>>
>>>     @           0x98f171  kudu::Webserver::BeginRequestCallbackStatic()
>>>
>>>     @           0x9b2f6e  (unknown)
>>>
>>>     @           0x9b586e  (unknown)
>>>
>>>     @           0x9b5f0c  (unknown)
>>>
>>>     @     0x7f4813ea1aa1  start_thread
>>>
>>>     @     0x7f4812c12aad  clone
>>>
>>>     @              (nil)  (unknown)
>>>
>>>
>>>
>>> *but ntptime shows OK:*
>>>
>>>
>>> ntp_gettime() returns code 0 (OK)
>>>
>>>   time dba7194c.f8dbce6c  Tue, Oct 11 2016 15:54:52.972, (.972104188),
>>>
>>>   maximum error 471276 us, estimated error 11 us, TAI offset 0
>>>
>>> ntp_adjtime() returns code 0 (OK)
>>>
>>>   modes 0x0 (),
>>>
>>>   offset -10.130 us, frequency 44.000 ppm, interval 1 s,
>>>
>>>   maximum error 471276 us, estimated error 11 us,
>>>
>>>   status 0x2001 (PLL,NANO),
>>>
>>>   time constant 7, precision 0.001 us, tolerance 500 ppm
>>>
>>>
>>>
>>> *And there're no ntp unsynchronized warnings in cloudera manager.*
>>>
>>>
>>>
>>> On Tue, Oct 11, 2016 at 3:29 PM, Darren Hoo <darren.hoo@gmail.com>
>>> wrote:
>>>
>>>> kudu master seldom crashes, but starting  with yesterday,  one of  our
>>>> two kud masters crashes very often
>>>>
>>>> Can anyone help to see what's going on?
>>>>
>>>> you can obtain get core file here : http://167.88.124.211:8000/c
>>>> ore.22459.xz
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message