impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Kudu start error with low ntpdate "maximum error"
Date Mon, 21 Nov 2016 15:19:28 GMT
On Sat, Nov 19, 2016 at 10:50 AM, Jim Apple <jbapple@cloudera.com> wrote:

> Can you do so?
>
> This job used ntpwait before starting and has syslog sent to the
> output; perhaps that will help debug:
>
> http://ec2-35-161-220-160.us-west-2.compute.amazonaws.com:80
> 80/job/ubuntu-14.04-from-scratch/147/
>
>
Yea, looks from the last ntptime output that it lost its NTP
synchronization. Looking at the startup log it appears that it's just
configured with the default Ubuntu NTP servers rather than using multiple
hosts inside EC2. Is it possible for you to adjust your AMI or change the
startup script to use the following lines in /etc/ntp.conf?

server 0.amazon.pool.ntp.org
server 1.amazon.pool.ntp.org
server 2.amazon.pool.ntp.org
server 3.amazon.pool.ntp.org

We could add a flag to ignore NTP sync, but it looks like once NTP goes out
of sync it's also reporting a maxerror of 16 seconds, which would also
cause an issue.

-Todd


> In the meantime, it seems we could use a workaround in order to get
> this working in ec2 more consistently.
>
> On Thu, Nov 17, 2016 at 12:21 PM, Todd Lipcon <todd@cloudera.com> wrote:
> > Nothing else comes to mind. Let me know if you think the solution of
> > allowing Kudu to ignore NTP status is preferable and I'll prioritize a
> > patch to unblock you guys.
> >
> > -Todd
> >
> > On Thu, Nov 17, 2016 at 11:43 AM, Jim Apple <jbapple@cloudera.com>
> wrote:
> >
> >> This is a normal functional test. The node is gone now, but I will add
> >> "cat /var/log/syslog" to the logging. Anything else I should add?
> >>
> >> On Thu, Nov 17, 2016 at 10:55 AM, Todd Lipcon <todd@cloudera.com>
> wrote:
> >> > On Thu, Nov 17, 2016 at 10:52 AM, Jim Apple <jbapple@cloudera.com>
> >> wrote:
> >> >
> >> >> The kudu master failure was at 16:36:28 UTC, but at 13:57:55 and
> >> >> 15:58:41 we have ntptime printing "status 0x2001 (PLL,NANO),".
> >> >>
> >> >
> >> > Do you see anything from ntp in /var/log/messages or syslog, etc?
> Usually
> >> > once NTP is synchronized it's fairly hard to knock it out of sync
> unless
> >> it
> >> > loses internet access. Is this a stress test or normal functional
> tests?
> >> >
> >> >
> >> >>
> >> >> On Thu, Nov 17, 2016 at 10:14 AM, Todd Lipcon <todd@cloudera.com>
> >> wrote:
> >> >> > On Thu, Nov 17, 2016 at 10:00 AM, Jim Apple <jbapple@cloudera.com>
> >> >> wrote:
> >> >> >
> >> >> >> I now have a Kudu master failing with "F1117 16:36:26.940562
> 113049
> >> >> >> hybrid_clock.cc:227] Couldn't get the current time: Clock
> >> >> >> unsynchronized. Status: Service unavailable: Error reading
clock.
> >> >> >> Clock considered unsynchronized"
> >> >> >>
> >> >> >> Todd, when you say "we need to wait for sync before running
tests
> or
> >> >> >> allow an unsafe flag to disable the check," do you mean "we"
> Impala,
> >> >> >> "we" Kudu, or both? If Kudu or both, is there a Kudu bug I
should
> >> >> >> follow? I don't see a likely candidate at
> >> >> >>
> >> >> >> https://issues.apache.org/jira/browse/KUDU-1202?jql=
> >> >> >> project%20%3D%20KUDU%20AND%20text%20~%20%22%5C%22Clock%
> >> >> >> 20considered%20unsynchronized%5C%22%22
> >> >> >
> >> >> >
> >> >> > Sorry, I was on my phone so wasn't as precise as I should have
> been. I
> >> >> > think that either:
> >> >> >
> >> >> > (a) Impala could add a script which runs prior to tests which
waits
> >> for
> >> >> > synchronization. Apparently there is a builtin "ntp-wait" command
> in
> >> some
> >> >> > distros. We use the following which also includes restarting NTP
a
> few
> >> >> > times which seems to speed things up:
> >> >> > https://gist.github.com/toddlipcon/97f7c8a4cf1d9c2551bd4289b
> 97dfe48
> >> >> >
> >> >> > (b) Kudu could add a flag like --ignore_bad_ntp_sync_status which
> you
> >> >> could
> >> >> > use in minicluster tests. In a minicluster where all the daemons
> run
> >> on
> >> >> one
> >> >> > node, it's sort of guaranteed that the clocks are in sync, so
maybe
> >> it's
> >> >> > not too bad of an idea. The only risk I can see is that the clock
> >> might
> >> >> be
> >> >> > adjusted backwards if NTP is not in sync status yet, and we may
> have
> >> some
> >> >> > assertions which would be unhappy if that happened. We could give
> it a
> >> >> try,
> >> >> > though, if the "wait-for-ntp" approach doesn't help.
> >> >> >
> >> >> > -Todd
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> On Wed, Nov 16, 2016 at 2:19 PM, Todd Lipcon <todd@cloudera.com>
> >> wrote:
> >> >> >> > This particular error message isn't related to the clock
error
> >> itself.
> >> >> >> It's
> >> >> >> > the sync status flag which indicates whether NTP thinks
its
> phase
> >> lock
> >> >> >> loop
> >> >> >> > is in sync or not. So, changing the allowed error shouldn't
have
> >> any
> >> >> >> > effect. Instead we need to wait for sync before running
tests or
> >> >> allow an
> >> >> >> > unsafe flag to disable the check.
> >> >> >> >
> >> >> >> > On Nov 16, 2016 1:55 PM, "Matthew Jacobs" <mj@cloudera.com>
> wrote:
> >> >> >> >
> >> >> >> >> Yeah, I don't doubt that the ntp output when you
checked it was
> >> >> wrong,
> >> >> >> >> but it may be different from the time when Kudu failed
to
> start,
> >> >> since
> >> >> >> >> that was probably at least several minutes later
(assuming you
> >> had to
> >> >> >> >> find out about the failed job, ssh into the machine,
discover
> the
> >> ntp
> >> >> >> >> warning and then check it at that time).
> >> >> >> >>
> >> >> >> >> I agree adding some logging at the beginning of jobs
is the
> best
> >> >> thing
> >> >> >> >> to do now.
> >> >> >> >>
> >> >> >> >> Thanks!
> >> >> >> >>
> >> >> >> >> On Wed, Nov 16, 2016 at 11:38 AM, Jim Apple <
> jbapple@cloudera.com
> >> >
> >> >> >> wrote:
> >> >> >> >> > I found a place where I suppose that could go
> >> >> >> >> > (testdata/cluster/node_templates/cdh5/etc/kudu/),
but I'm
> not
> >> sure
> >> >> >> >> > what I want to set it at. If "maximum error
197431 us" is to
> be
> >> >> >> >> > believed, it needs to be at least 0.2 seconds,
and if
> >> >> >> >> > https://kudu.apache.org/docs/troubleshooting.html
is to be
> >> >> believed,
> >> >> >> >> > it is already 10 seconds.
> >> >> >> >> >
> >> >> >> >> > I added some more logging to try and see what
is going on,
> and
> >> if I
> >> >> >> >> > hit this again I'll re-open the thread. For
now, I am not
> >> planning
> >> >> to
> >> >> >> >> > submit a patch to "fix" it because I'm not sure
increasing
> the
> >> >> number
> >> >> >> >> > is the real solution.
> >> >> >> >> >
> >> >> >> >> > On Wed, Nov 16, 2016 at 11:00 AM, Matthew Jacobs
<
> >> mj@cloudera.com>
> >> >> >> >> wrote:
> >> >> >> >> >> According to the error message, it looks
like we can specify
> >> the
> >> >> >> >> >> '--max_clock_sync_error_usec' flag when
starting the Kudu
> >> >> processes.
> >> >> >> >> >> We may want to start by printing ntptime
output at the
> >> beginning
> >> >> of
> >> >> >> >> >> jobs so we can see how far off it is. If
it's off by days
> then
> >> >> maybe
> >> >> >> >> >> changing the error isn't a good idea, and
we'll need to
> figure
> >> out
> >> >> >> >> >> something else.
> >> >> >> >> >>
> >> >> >> >> >> On Wed, Nov 16, 2016 at 10:55 AM, Jim Apple
<
> >> jbapple@cloudera.com
> >> >> >
> >> >> >> >> wrote:
> >> >> >> >> >>> How do we bump up the allowable error?
> >> >> >> >> >>>
> >> >> >> >> >>> On Wed, Nov 16, 2016 at 10:20 AM, Matthew
Jacobs <
> >> >> mj@cloudera.com>
> >> >> >> >> wrote:
> >> >> >> >> >>>> I asked on the Kudu slack channel,
they have seen issues
> >> where
> >> >> >> freshly
> >> >> >> >> >>>> provisioned ec2 nodes take some
time for ntp to quiesce,
> but
> >> >> they
> >> >> >> >> >>>> didn't have a sense of how long
that might take. If you
> >> checked
> >> >> >> >> >>>> ntptime after the job failed, it
may be that ntp had
> enough
> >> >> time.
> >> >> >> We
> >> >> >> >> >>>> can probably consider bumping up
the allowable error.
> >> >> >> >> >>>>
> >> >> >> >> >>>> On Wed, Nov 16, 2016 at 9:24 AM,
Jim Apple <
> >> >> jbapple@cloudera.com>
> >> >> >> >> wrote:
> >> >> >> >> >>>>> This is the second time I have
seen it, but it doesn't
> >> happen
> >> >> >> every
> >> >> >> >> >>>>> time. It could very well be
a difference on ec2; already
> >> I've
> >> >> seen
> >> >> >> >> >>>>> some bugs due to my ec2 instances
being Etc/UTC timezone
> >> while
> >> >> >> most
> >> >> >> >> >>>>> Impala developers work in America/Los_Angeles.
> >> >> >> >> >>>>>
> >> >> >> >> >>>>> On Wed, Nov 16, 2016 at 9:10
AM, Matthew Jacobs <
> >> >> mj@cloudera.com>
> >> >> >> >> wrote:
> >> >> >> >> >>>>>> No problem. If this happens
again we should ask the Kudu
> >> >> >> >> developers. I
> >> >> >> >> >>>>>> haven't seen this before
- I wonder if it could be some
> >> >> >> weirdness on
> >> >> >> >> >>>>>> ec2...
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> Thanks
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> On Wed, Nov 16, 2016 at
9:01 AM, Jim Apple <
> >> >> jbapple@cloudera.com
> >> >> >> >
> >> >> >> >> wrote:
> >> >> >> >> >>>>>>> Thank you for your help!
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>> This was on an AWS machine
that has expired, but I can
> see
> >> >> from
> >> >> >> the
> >> >> >> >> >>>>>>> logs that "IMPALA_KUDU_VERSION=88b023"
and
> >> >> >> >> >>>>>>> "KUDU_JAVA_VERSION=1.0.0-SNAPSHOT"
and "Downloading
> >> >> >> >> >>>>>>> kudu-python-0.3.0.tar.gz"
and "URL
> >> >> >> >> >>>>>>> https://native-toolchain.s3.amazonaws.com/build/264-
> >> >> >> >> e9d44349ba/kudu/88b023-gcc-4.9.2/kudu-88b023-gcc-4.9.2-ec2-
> >> >> >> >> package-ubuntu-14-04.tar.gz".
> >> >> >> >> >>>>>>> I'll add "ps aux | grep
kudu" to the logging this
> machine
> >> >> does
> >> >> >> on
> >> >> >> >> >>>>>>> error, so we'll have
it next time, but I did "ps
> -Afly" on
> >> >> exit
> >> >> >> and
> >> >> >> >> >>>>>>> there were no kudu processes
running, it looks like.
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>> On Wed, Nov 16, 2016
at 8:52 AM, Matthew Jacobs <
> >> >> >> mj@cloudera.com>
> >> >> >> >> wrote:
> >> >> >> >> >>>>>>>> Can you check which
version of the client you're
> building
> >> >> >> against
> >> >> >> >> >>>>>>>> (KUDU_VERSION env
var) vs what Kudu version is running
> >> (ps
> >> >> aux
> >> >> >> |
> >> >> >> >> grep
> >> >> >> >> >>>>>>>> kudu
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>> On Wed, Nov 16,
2016 at 8:48 AM, Jim Apple <
> >> >> >> jbapple@cloudera.com>
> >> >> >> >> wrote:
> >> >> >> >> >>>>>>>>> Yes.
> >> >> >> >> >>>>>>>>>
> >> >> >> >> >>>>>>>>> On Wed, Nov
16, 2016 at 7:45 AM, Matthew Jacobs <
> >> >> >> mj@cloudera.com>
> >> >> >> >> wrote:
> >> >> >> >> >>>>>>>>>> Do you have
NTP installed?
> >> >> >> >> >>>>>>>>>>
> >> >> >> >> >>>>>>>>>> On Tue,
Nov 15, 2016 at 9:22 PM, Jim Apple <
> >> >> >> >> jbapple@cloudera.com> wrote:
> >> >> >> >> >>>>>>>>>>> I have
a machine where Kudu failed to start:
> >> >> >> >> >>>>>>>>>>>
> >> >> >> >> >>>>>>>>>>> F1116
05:02:00.173629 71098
> tablet_server_main.cc:64]
> >> >> Check
> >> >> >> >> failed:
> >> >> >> >> >>>>>>>>>>> _s.ok()
Bad status: Service unavailable: Cannot
> >> >> initialize
> >> >> >> >> clock:
> >> >> >> >> >>>>>>>>>>> Error
reading clock. Clock considered
> unsynchronized
> >> >> >> >> >>>>>>>>>>>
> >> >> >> >> >>>>>>>>>>> https://kudu.apache.org/docs/troubleshooting.html
> >> says:
> >> >> >> >> >>>>>>>>>>>
> >> >> >> >> >>>>>>>>>>> "For
the master and tablet server daemons, the
> >> server’s
> >> >> >> clock
> >> >> >> >> must be
> >> >> >> >> >>>>>>>>>>> synchronized
using NTP. In addition, the maximum
> clock
> >> >> error
> >> >> >> >> (not to
> >> >> >> >> >>>>>>>>>>> be mistaken
with the estimated error) be below a
> >> >> >> configurable
> >> >> >> >> >>>>>>>>>>> threshold.
The default value is 10 seconds, but it
> >> can be
> >> >> >> set
> >> >> >> >> with the
> >> >> >> >> >>>>>>>>>>> flag
--max_clock_sync_error_usec."
> >> >> >> >> >>>>>>>>>>>
> >> >> >> >> >>>>>>>>>>> and
> >> >> >> >> >>>>>>>>>>>
> >> >> >> >> >>>>>>>>>>> "If
NTP is installed the user can monitor the
> >> >> >> synchronization
> >> >> >> >> status
> >> >> >> >> >>>>>>>>>>> by running
ntptime. The relevant value is what is
> >> >> reported
> >> >> >> for
> >> >> >> >> maximum
> >> >> >> >> >>>>>>>>>>> error."
> >> >> >> >> >>>>>>>>>>>
> >> >> >> >> >>>>>>>>>>> ntptime
reports:
> >> >> >> >> >>>>>>>>>>>
> >> >> >> >> >>>>>>>>>>> ntp_gettime()
returns code 0 (OK)
> >> >> >> >> >>>>>>>>>>>   time
dbd66a6a.59bca948  Wed, Nov 16 2016
> >> 5:17:30.350,
> >> >> >> >> (.350535824),
> >> >> >> >> >>>>>>>>>>>   maximum
error 197431 us, estimated error 71015
> us,
> >> TAI
> >> >> >> >> offset 0
> >> >> >> >> >>>>>>>>>>> ntp_adjtime()
returns code 0 (OK)
> >> >> >> >> >>>>>>>>>>>   modes
0x0 (),
> >> >> >> >> >>>>>>>>>>>   offset
74989.459 us, frequency 19.950 ppm,
> interval
> >> 1
> >> >> s,
> >> >> >> >> >>>>>>>>>>>   maximum
error 197431 us, estimated error 71015
> us,
> >> >> >> >> >>>>>>>>>>>   status
0x2001 (PLL,NANO),
> >> >> >> >> >>>>>>>>>>>   time
constant 6, precision 0.001 us, tolerance
> 500
> >> ppm,
> >> >> >> >> >>>>>>>>>>>
> >> >> >> >> >>>>>>>>>>> So it
looks like this error is anticipated, but the
> >> >> expected
> >> >> >> >> >>>>>>>>>>> conditions
for it to occur are absent. Any ideas
> what
> >> >> could
> >> >> >> be
> >> >> >> >> going
> >> >> >> >> >>>>>>>>>>> on here?
This is with a recent checkout of Impala
> >> master.
> >> >> >> >>
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Todd Lipcon
> >> >> > Software Engineer, Cloudera
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Todd Lipcon
> >> > Software Engineer, Cloudera
> >>
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message