impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Apple <jbap...@cloudera.com>
Subject Re: Kudu start error with low ntpdate "maximum error"
Date Sat, 19 Nov 2016 18:50:16 GMT
Can you do so?

This job used ntpwait before starting and has syslog sent to the
output; perhaps that will help debug:

http://ec2-35-161-220-160.us-west-2.compute.amazonaws.com:8080/job/ubuntu-14.04-from-scratch/147/

In the meantime, it seems we could use a workaround in order to get
this working in ec2 more consistently.

On Thu, Nov 17, 2016 at 12:21 PM, Todd Lipcon <todd@cloudera.com> wrote:
> Nothing else comes to mind. Let me know if you think the solution of
> allowing Kudu to ignore NTP status is preferable and I'll prioritize a
> patch to unblock you guys.
>
> -Todd
>
> On Thu, Nov 17, 2016 at 11:43 AM, Jim Apple <jbapple@cloudera.com> wrote:
>
>> This is a normal functional test. The node is gone now, but I will add
>> "cat /var/log/syslog" to the logging. Anything else I should add?
>>
>> On Thu, Nov 17, 2016 at 10:55 AM, Todd Lipcon <todd@cloudera.com> wrote:
>> > On Thu, Nov 17, 2016 at 10:52 AM, Jim Apple <jbapple@cloudera.com>
>> wrote:
>> >
>> >> The kudu master failure was at 16:36:28 UTC, but at 13:57:55 and
>> >> 15:58:41 we have ntptime printing "status 0x2001 (PLL,NANO),".
>> >>
>> >
>> > Do you see anything from ntp in /var/log/messages or syslog, etc? Usually
>> > once NTP is synchronized it's fairly hard to knock it out of sync unless
>> it
>> > loses internet access. Is this a stress test or normal functional tests?
>> >
>> >
>> >>
>> >> On Thu, Nov 17, 2016 at 10:14 AM, Todd Lipcon <todd@cloudera.com>
>> wrote:
>> >> > On Thu, Nov 17, 2016 at 10:00 AM, Jim Apple <jbapple@cloudera.com>
>> >> wrote:
>> >> >
>> >> >> I now have a Kudu master failing with "F1117 16:36:26.940562 113049
>> >> >> hybrid_clock.cc:227] Couldn't get the current time: Clock
>> >> >> unsynchronized. Status: Service unavailable: Error reading clock.
>> >> >> Clock considered unsynchronized"
>> >> >>
>> >> >> Todd, when you say "we need to wait for sync before running tests
or
>> >> >> allow an unsafe flag to disable the check," do you mean "we" Impala,
>> >> >> "we" Kudu, or both? If Kudu or both, is there a Kudu bug I should
>> >> >> follow? I don't see a likely candidate at
>> >> >>
>> >> >> https://issues.apache.org/jira/browse/KUDU-1202?jql=
>> >> >> project%20%3D%20KUDU%20AND%20text%20~%20%22%5C%22Clock%
>> >> >> 20considered%20unsynchronized%5C%22%22
>> >> >
>> >> >
>> >> > Sorry, I was on my phone so wasn't as precise as I should have been.
I
>> >> > think that either:
>> >> >
>> >> > (a) Impala could add a script which runs prior to tests which waits
>> for
>> >> > synchronization. Apparently there is a builtin "ntp-wait" command in
>> some
>> >> > distros. We use the following which also includes restarting NTP a
few
>> >> > times which seems to speed things up:
>> >> > https://gist.github.com/toddlipcon/97f7c8a4cf1d9c2551bd4289b97dfe48
>> >> >
>> >> > (b) Kudu could add a flag like --ignore_bad_ntp_sync_status which you
>> >> could
>> >> > use in minicluster tests. In a minicluster where all the daemons run
>> on
>> >> one
>> >> > node, it's sort of guaranteed that the clocks are in sync, so maybe
>> it's
>> >> > not too bad of an idea. The only risk I can see is that the clock
>> might
>> >> be
>> >> > adjusted backwards if NTP is not in sync status yet, and we may have
>> some
>> >> > assertions which would be unhappy if that happened. We could give it
a
>> >> try,
>> >> > though, if the "wait-for-ntp" approach doesn't help.
>> >> >
>> >> > -Todd
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >>
>> >> >> On Wed, Nov 16, 2016 at 2:19 PM, Todd Lipcon <todd@cloudera.com>
>> wrote:
>> >> >> > This particular error message isn't related to the clock error
>> itself.
>> >> >> It's
>> >> >> > the sync status flag which indicates whether NTP thinks its
phase
>> lock
>> >> >> loop
>> >> >> > is in sync or not. So, changing the allowed error shouldn't
have
>> any
>> >> >> > effect. Instead we need to wait for sync before running tests
or
>> >> allow an
>> >> >> > unsafe flag to disable the check.
>> >> >> >
>> >> >> > On Nov 16, 2016 1:55 PM, "Matthew Jacobs" <mj@cloudera.com>
wrote:
>> >> >> >
>> >> >> >> Yeah, I don't doubt that the ntp output when you checked
it was
>> >> wrong,
>> >> >> >> but it may be different from the time when Kudu failed
to start,
>> >> since
>> >> >> >> that was probably at least several minutes later (assuming
you
>> had to
>> >> >> >> find out about the failed job, ssh into the machine, discover
the
>> ntp
>> >> >> >> warning and then check it at that time).
>> >> >> >>
>> >> >> >> I agree adding some logging at the beginning of jobs is
the best
>> >> thing
>> >> >> >> to do now.
>> >> >> >>
>> >> >> >> Thanks!
>> >> >> >>
>> >> >> >> On Wed, Nov 16, 2016 at 11:38 AM, Jim Apple <jbapple@cloudera.com
>> >
>> >> >> wrote:
>> >> >> >> > I found a place where I suppose that could go
>> >> >> >> > (testdata/cluster/node_templates/cdh5/etc/kudu/),
but I'm not
>> sure
>> >> >> >> > what I want to set it at. If "maximum error 197431
us" is to be
>> >> >> >> > believed, it needs to be at least 0.2 seconds, and
if
>> >> >> >> > https://kudu.apache.org/docs/troubleshooting.html
is to be
>> >> believed,
>> >> >> >> > it is already 10 seconds.
>> >> >> >> >
>> >> >> >> > I added some more logging to try and see what is
going on, and
>> if I
>> >> >> >> > hit this again I'll re-open the thread. For now,
I am not
>> planning
>> >> to
>> >> >> >> > submit a patch to "fix" it because I'm not sure increasing
the
>> >> number
>> >> >> >> > is the real solution.
>> >> >> >> >
>> >> >> >> > On Wed, Nov 16, 2016 at 11:00 AM, Matthew Jacobs
<
>> mj@cloudera.com>
>> >> >> >> wrote:
>> >> >> >> >> According to the error message, it looks like
we can specify
>> the
>> >> >> >> >> '--max_clock_sync_error_usec' flag when starting
the Kudu
>> >> processes.
>> >> >> >> >> We may want to start by printing ntptime output
at the
>> beginning
>> >> of
>> >> >> >> >> jobs so we can see how far off it is. If it's
off by days then
>> >> maybe
>> >> >> >> >> changing the error isn't a good idea, and we'll
need to figure
>> out
>> >> >> >> >> something else.
>> >> >> >> >>
>> >> >> >> >> On Wed, Nov 16, 2016 at 10:55 AM, Jim Apple <
>> jbapple@cloudera.com
>> >> >
>> >> >> >> wrote:
>> >> >> >> >>> How do we bump up the allowable error?
>> >> >> >> >>>
>> >> >> >> >>> On Wed, Nov 16, 2016 at 10:20 AM, Matthew
Jacobs <
>> >> mj@cloudera.com>
>> >> >> >> wrote:
>> >> >> >> >>>> I asked on the Kudu slack channel, they
have seen issues
>> where
>> >> >> freshly
>> >> >> >> >>>> provisioned ec2 nodes take some time
for ntp to quiesce, but
>> >> they
>> >> >> >> >>>> didn't have a sense of how long that
might take. If you
>> checked
>> >> >> >> >>>> ntptime after the job failed, it may
be that ntp had enough
>> >> time.
>> >> >> We
>> >> >> >> >>>> can probably consider bumping up the
allowable error.
>> >> >> >> >>>>
>> >> >> >> >>>> On Wed, Nov 16, 2016 at 9:24 AM, Jim
Apple <
>> >> jbapple@cloudera.com>
>> >> >> >> wrote:
>> >> >> >> >>>>> This is the second time I have seen
it, but it doesn't
>> happen
>> >> >> every
>> >> >> >> >>>>> time. It could very well be a difference
on ec2; already
>> I've
>> >> seen
>> >> >> >> >>>>> some bugs due to my ec2 instances
being Etc/UTC timezone
>> while
>> >> >> most
>> >> >> >> >>>>> Impala developers work in America/Los_Angeles.
>> >> >> >> >>>>>
>> >> >> >> >>>>> On Wed, Nov 16, 2016 at 9:10 AM,
Matthew Jacobs <
>> >> mj@cloudera.com>
>> >> >> >> wrote:
>> >> >> >> >>>>>> No problem. If this happens again
we should ask the Kudu
>> >> >> >> developers. I
>> >> >> >> >>>>>> haven't seen this before - I
wonder if it could be some
>> >> >> weirdness on
>> >> >> >> >>>>>> ec2...
>> >> >> >> >>>>>>
>> >> >> >> >>>>>> Thanks
>> >> >> >> >>>>>>
>> >> >> >> >>>>>> On Wed, Nov 16, 2016 at 9:01
AM, Jim Apple <
>> >> jbapple@cloudera.com
>> >> >> >
>> >> >> >> wrote:
>> >> >> >> >>>>>>> Thank you for your help!
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>> This was on an AWS machine
that has expired, but I can see
>> >> from
>> >> >> the
>> >> >> >> >>>>>>> logs that "IMPALA_KUDU_VERSION=88b023"
and
>> >> >> >> >>>>>>> "KUDU_JAVA_VERSION=1.0.0-SNAPSHOT"
and "Downloading
>> >> >> >> >>>>>>> kudu-python-0.3.0.tar.gz"
and "URL
>> >> >> >> >>>>>>> https://native-toolchain.s3.amazonaws.com/build/264-
>> >> >> >> e9d44349ba/kudu/88b023-gcc-4.9.2/kudu-88b023-gcc-4.9.2-ec2-
>> >> >> >> package-ubuntu-14-04.tar.gz".
>> >> >> >> >>>>>>> I'll add "ps aux | grep kudu"
to the logging this machine
>> >> does
>> >> >> on
>> >> >> >> >>>>>>> error, so we'll have it next
time, but I did "ps -Afly" on
>> >> exit
>> >> >> and
>> >> >> >> >>>>>>> there were no kudu processes
running, it looks like.
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>> On Wed, Nov 16, 2016 at 8:52
AM, Matthew Jacobs <
>> >> >> mj@cloudera.com>
>> >> >> >> wrote:
>> >> >> >> >>>>>>>> Can you check which version
of the client you're building
>> >> >> against
>> >> >> >> >>>>>>>> (KUDU_VERSION env var)
vs what Kudu version is running
>> (ps
>> >> aux
>> >> >> |
>> >> >> >> grep
>> >> >> >> >>>>>>>> kudu
>> >> >> >> >>>>>>>>
>> >> >> >> >>>>>>>> On Wed, Nov 16, 2016
at 8:48 AM, Jim Apple <
>> >> >> jbapple@cloudera.com>
>> >> >> >> wrote:
>> >> >> >> >>>>>>>>> Yes.
>> >> >> >> >>>>>>>>>
>> >> >> >> >>>>>>>>> On Wed, Nov 16, 2016
at 7:45 AM, Matthew Jacobs <
>> >> >> mj@cloudera.com>
>> >> >> >> wrote:
>> >> >> >> >>>>>>>>>> Do you have NTP
installed?
>> >> >> >> >>>>>>>>>>
>> >> >> >> >>>>>>>>>> On Tue, Nov 15,
2016 at 9:22 PM, Jim Apple <
>> >> >> >> jbapple@cloudera.com> wrote:
>> >> >> >> >>>>>>>>>>> I have a
machine where Kudu failed to start:
>> >> >> >> >>>>>>>>>>>
>> >> >> >> >>>>>>>>>>> F1116 05:02:00.173629
71098 tablet_server_main.cc:64]
>> >> Check
>> >> >> >> failed:
>> >> >> >> >>>>>>>>>>> _s.ok() Bad
status: Service unavailable: Cannot
>> >> initialize
>> >> >> >> clock:
>> >> >> >> >>>>>>>>>>> Error reading
clock. Clock considered unsynchronized
>> >> >> >> >>>>>>>>>>>
>> >> >> >> >>>>>>>>>>> https://kudu.apache.org/docs/troubleshooting.html
>> says:
>> >> >> >> >>>>>>>>>>>
>> >> >> >> >>>>>>>>>>> "For the
master and tablet server daemons, the
>> server’s
>> >> >> clock
>> >> >> >> must be
>> >> >> >> >>>>>>>>>>> synchronized
using NTP. In addition, the maximum clock
>> >> error
>> >> >> >> (not to
>> >> >> >> >>>>>>>>>>> be mistaken
with the estimated error) be below a
>> >> >> configurable
>> >> >> >> >>>>>>>>>>> threshold.
The default value is 10 seconds, but it
>> can be
>> >> >> set
>> >> >> >> with the
>> >> >> >> >>>>>>>>>>> flag --max_clock_sync_error_usec."
>> >> >> >> >>>>>>>>>>>
>> >> >> >> >>>>>>>>>>> and
>> >> >> >> >>>>>>>>>>>
>> >> >> >> >>>>>>>>>>> "If NTP is
installed the user can monitor the
>> >> >> synchronization
>> >> >> >> status
>> >> >> >> >>>>>>>>>>> by running
ntptime. The relevant value is what is
>> >> reported
>> >> >> for
>> >> >> >> maximum
>> >> >> >> >>>>>>>>>>> error."
>> >> >> >> >>>>>>>>>>>
>> >> >> >> >>>>>>>>>>> ntptime reports:
>> >> >> >> >>>>>>>>>>>
>> >> >> >> >>>>>>>>>>> ntp_gettime()
returns code 0 (OK)
>> >> >> >> >>>>>>>>>>>   time dbd66a6a.59bca948
 Wed, Nov 16 2016
>> 5:17:30.350,
>> >> >> >> (.350535824),
>> >> >> >> >>>>>>>>>>>   maximum
error 197431 us, estimated error 71015 us,
>> TAI
>> >> >> >> offset 0
>> >> >> >> >>>>>>>>>>> ntp_adjtime()
returns code 0 (OK)
>> >> >> >> >>>>>>>>>>>   modes 0x0
(),
>> >> >> >> >>>>>>>>>>>   offset
74989.459 us, frequency 19.950 ppm, interval
>> 1
>> >> s,
>> >> >> >> >>>>>>>>>>>   maximum
error 197431 us, estimated error 71015 us,
>> >> >> >> >>>>>>>>>>>   status
0x2001 (PLL,NANO),
>> >> >> >> >>>>>>>>>>>   time constant
6, precision 0.001 us, tolerance 500
>> ppm,
>> >> >> >> >>>>>>>>>>>
>> >> >> >> >>>>>>>>>>> So it looks
like this error is anticipated, but the
>> >> expected
>> >> >> >> >>>>>>>>>>> conditions
for it to occur are absent. Any ideas what
>> >> could
>> >> >> be
>> >> >> >> going
>> >> >> >> >>>>>>>>>>> on here?
This is with a recent checkout of Impala
>> master.
>> >> >> >>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Todd Lipcon
>> >> > Software Engineer, Cloudera
>> >>
>> >
>> >
>> >
>> > --
>> > Todd Lipcon
>> > Software Engineer, Cloudera
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera

Mime
View raw message