hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Alten-Lorenz <wget.n...@gmail.com>
Subject Re: Stopping ntpd signals SIGTERM, then causes namenode exit
Date Mon, 09 Feb 2015 18:28:33 GMT
I would spot on 

Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable

looks for me like an network / DNS issue. You can check per dmesg whats going on, too.

BR
- Alexander

> On 09 Feb 2015, at 17:57, daemeon reiydelle <daemeonr@gmail.com> wrote:
> 
> Absolutely a critical error to lose the configured ntpd time source in Hadoop. The replication
and many other services require absolutely millisecond time sync between the nodes. Interesting
that your SRE design called for ntpd running on each node. Curious.
> 
> What is the problem you are trying to solve by stopping ntpd on the local host? Did someone
not understand how ntpd works? Did someone configure it to (I sure hope not) be free running?
> 
> 
> 
> .......
> “Life should not be a journey to the grave with the intention of arriving safely in
a
> pretty and well preserved body, but rather to skid in broadside in a cloud of smoke,
> thoroughly used up, totally worn out, and loudly proclaiming “Wow! What a Ride!”

> - Hunter Thompson
> 
> Daemeon C.M. Reiydelle
> USA (+1) 415.501.0198
> London (+44) (0) 20 8144 9872
> 
> On Sun, Feb 8, 2015 at 7:30 PM, David chen <c77_cn@163.com <mailto:c77_cn@163.com>>
wrote:
> A shell script is deployed on every node of HDFS cluster, the script is invoked hourly
by crontab, and its content is as follows:
> #!/bin/bash
> service ntpd stop
> ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
> service ntpd start
> chkconfig ntpd on
> 
> After several days, NameNode crashed suddenly, but its log seemed no other errors except
the following:
> 2015-01-07 14:00:00,709 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED
SIGNAL 15: SIGTERM
> 
> Inspected the Linux log(Centos /var/log/messages), also found the following clues:
> Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
> Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22 11:23:27 UTC 2013
(1)
> Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard, 0.0.0.0#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard, ::#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2, fe80::ca1f:66ff:fee1:eed#123
Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo, 127.0.0.1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2, 192.168.1.151#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22 for interface
updates
> Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
> Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from /var/lib/ntp/drift
> Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished, exiting
> Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
> Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
> Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...  shutting down...
> Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing attributes.
> Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
> Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable
> 
> It looks likely that NameNode received the SIGTERM signal sent by stopping ntpd command.
> Up to now, the problem has happened three times repeatedly, the time point was Jan  7
14:00:00, Jan 14 14:00:00 and Feb  4 14:00:00 respectively.
> Although the script to synchronize time is a little improper, and i also know the correct
synchronized way. but i wonder why NameNode can receive the SIGTERM signal sent by stopping
ntpd command? and why three times all happened at 14:00:00?
> Any ideas can be appreciated.
> 


Mime
View raw message