hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David chen" <c77...@163.com>
Subject Stopping ntpd signals SIGTERM, then causes namenode exit
Date Mon, 09 Feb 2015 03:30:08 GMT
A shell script is deployed on every node of HDFS cluster, the script is invoked hourly by crontab,
and its content is as follows:
#!/bin/bash
service ntpd stop
ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
service ntpd start
chkconfig ntpd on


After several days, NameNode crashed suddenly, but its log seemed no other errors except the
following:
2015-01-07 14:00:00,709 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL
15: SIGTERM


Inspected the Linux log(Centos /var/log/messages), also found the following clues:
Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22 11:23:27 UTC 2013 (1)
Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard, 0.0.0.0#123 Disabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard, ::#123 Disabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2, fe80::ca1f:66ff:fee1:eed#123
Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo, 127.0.0.1#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2, 192.168.1.151#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22 for interface updates
Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from /var/lib/ntp/drift
Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished, exiting
Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...  shutting down...
Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing attributes.
Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable


It looks likely that NameNode received the SIGTERM signal sent by stopping ntpd command.
Up to now, the problem has happened three times repeatedly, the time point was Jan  7 14:00:00,
Jan 14 14:00:00 and Feb  4 14:00:00 respectively.
Although the script to synchronize time is a little improper, and i also know the correct
synchronized way. but i wonder why NameNode can receive the SIGTERM signal sent by stopping
ntpd command? and why three times all happened at 14:00:00?
Any ideas can be appreciated.
Mime
View raw message