impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Jacobs>
Subject Re: Kudu start error with low ntpdate "maximum error"
Date Wed, 16 Nov 2016 19:48:33 GMT
Yeah, I don't doubt that the ntp output when you checked it was wrong,
but it may be different from the time when Kudu failed to start, since
that was probably at least several minutes later (assuming you had to
find out about the failed job, ssh into the machine, discover the ntp
warning and then check it at that time).

I agree adding some logging at the beginning of jobs is the best thing
to do now.


On Wed, Nov 16, 2016 at 11:38 AM, Jim Apple <> wrote:
> I found a place where I suppose that could go
> (testdata/cluster/node_templates/cdh5/etc/kudu/), but I'm not sure
> what I want to set it at. If "maximum error 197431 us" is to be
> believed, it needs to be at least 0.2 seconds, and if
> is to be believed,
> it is already 10 seconds.
> I added some more logging to try and see what is going on, and if I
> hit this again I'll re-open the thread. For now, I am not planning to
> submit a patch to "fix" it because I'm not sure increasing the number
> is the real solution.
> On Wed, Nov 16, 2016 at 11:00 AM, Matthew Jacobs <> wrote:
>> According to the error message, it looks like we can specify the
>> '--max_clock_sync_error_usec' flag when starting the Kudu processes.
>> We may want to start by printing ntptime output at the beginning of
>> jobs so we can see how far off it is. If it's off by days then maybe
>> changing the error isn't a good idea, and we'll need to figure out
>> something else.
>> On Wed, Nov 16, 2016 at 10:55 AM, Jim Apple <> wrote:
>>> How do we bump up the allowable error?
>>> On Wed, Nov 16, 2016 at 10:20 AM, Matthew Jacobs <> wrote:
>>>> I asked on the Kudu slack channel, they have seen issues where freshly
>>>> provisioned ec2 nodes take some time for ntp to quiesce, but they
>>>> didn't have a sense of how long that might take. If you checked
>>>> ntptime after the job failed, it may be that ntp had enough time. We
>>>> can probably consider bumping up the allowable error.
>>>> On Wed, Nov 16, 2016 at 9:24 AM, Jim Apple <> wrote:
>>>>> This is the second time I have seen it, but it doesn't happen every
>>>>> time. It could very well be a difference on ec2; already I've seen
>>>>> some bugs due to my ec2 instances being Etc/UTC timezone while most
>>>>> Impala developers work in America/Los_Angeles.
>>>>> On Wed, Nov 16, 2016 at 9:10 AM, Matthew Jacobs <>
>>>>>> No problem. If this happens again we should ask the Kudu developers.
>>>>>> haven't seen this before - I wonder if it could be some weirdness
>>>>>> ec2...
>>>>>> Thanks
>>>>>> On Wed, Nov 16, 2016 at 9:01 AM, Jim Apple <>
>>>>>>> Thank you for your help!
>>>>>>> This was on an AWS machine that has expired, but I can see from
>>>>>>> logs that "IMPALA_KUDU_VERSION=88b023" and
>>>>>>> "KUDU_JAVA_VERSION=1.0.0-SNAPSHOT" and "Downloading
>>>>>>> kudu-python-0.3.0.tar.gz" and "URL
>>>>>>> I'll add "ps aux | grep kudu" to the logging this machine does
>>>>>>> error, so we'll have it next time, but I did "ps -Afly" on exit
>>>>>>> there were no kudu processes running, it looks like.
>>>>>>> On Wed, Nov 16, 2016 at 8:52 AM, Matthew Jacobs <>
>>>>>>>> Can you check which version of the client you're building
>>>>>>>> (KUDU_VERSION env var) vs what Kudu version is running (ps
aux | grep
>>>>>>>> kudu
>>>>>>>> On Wed, Nov 16, 2016 at 8:48 AM, Jim Apple <>
>>>>>>>>> Yes.
>>>>>>>>> On Wed, Nov 16, 2016 at 7:45 AM, Matthew Jacobs <>
>>>>>>>>>> Do you have NTP installed?
>>>>>>>>>> On Tue, Nov 15, 2016 at 9:22 PM, Jim Apple <>
>>>>>>>>>>> I have a machine where Kudu failed to start:
>>>>>>>>>>> F1116 05:02:00.173629 71098]
Check failed:
>>>>>>>>>>> _s.ok() Bad status: Service unavailable: Cannot
initialize clock:
>>>>>>>>>>> Error reading clock. Clock considered unsynchronized
>>>>>>>>>>> "For the master and tablet server daemons, the
server’s clock must be
>>>>>>>>>>> synchronized using NTP. In addition, the maximum
clock error (not to
>>>>>>>>>>> be mistaken with the estimated error) be below
a configurable
>>>>>>>>>>> threshold. The default value is 10 seconds, but
it can be set with the
>>>>>>>>>>> flag --max_clock_sync_error_usec."
>>>>>>>>>>> and
>>>>>>>>>>> "If NTP is installed the user can monitor the
synchronization status
>>>>>>>>>>> by running ntptime. The relevant value is what
is reported for maximum
>>>>>>>>>>> error."
>>>>>>>>>>> ntptime reports:
>>>>>>>>>>> ntp_gettime() returns code 0 (OK)
>>>>>>>>>>>   time dbd66a6a.59bca948  Wed, Nov 16 2016  5:17:30.350,
>>>>>>>>>>>   maximum error 197431 us, estimated error 71015
us, TAI offset 0
>>>>>>>>>>> ntp_adjtime() returns code 0 (OK)
>>>>>>>>>>>   modes 0x0 (),
>>>>>>>>>>>   offset 74989.459 us, frequency 19.950 ppm,
interval 1 s,
>>>>>>>>>>>   maximum error 197431 us, estimated error 71015
>>>>>>>>>>>   status 0x2001 (PLL,NANO),
>>>>>>>>>>>   time constant 6, precision 0.001 us, tolerance
500 ppm,
>>>>>>>>>>> So it looks like this error is anticipated, but
the expected
>>>>>>>>>>> conditions for it to occur are absent. Any ideas
what could be going
>>>>>>>>>>> on here? This is with a recent checkout of Impala

View raw message