Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B42C510FF8 for ; Wed, 11 Feb 2015 07:38:47 +0000 (UTC) Received: (qmail 74463 invoked by uid 500); 11 Feb 2015 07:38:42 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 74357 invoked by uid 500); 11 Feb 2015 07:38:42 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 74347 invoked by uid 99); 11 Feb 2015 07:38:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Feb 2015 07:38:42 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of or.sher1@gmail.com designates 209.85.214.174 as permitted sender) Received: from [209.85.214.174] (HELO mail-ob0-f174.google.com) (209.85.214.174) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Feb 2015 07:38:17 +0000 Received: by mail-ob0-f174.google.com with SMTP id wo20so1682100obc.5 for ; Tue, 10 Feb 2015 23:36:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=lSq2sNBPQpORn8OoALn1bCPFrTDouJYQltoLoJ6uyM8=; b=XvHFAqfPQkM+QMcRk+rm/swqSQFey5V7B2/Y6veEkwzAvWmVl57jXXOJu+yG+h8MSK fD6CExwM1YmjIgaNa0C3TIz05VaxUCmSgb5qOIGdO5Md6V7LSAjrT7S6gex99QqncAQu p8wX9LKfpz6QW4x00joYfv7uzBYDL3bgDfGj+Q/QFbdY5ftMQvTA60yOuy+hKgkTDD2x WRLwDcdbhFmDHTDt+2YZrnclNIHO73FxAI2EM8ej5/GvbmLQP4dx6dWhHSP66YvKZAkM dlFCkBfdzMpsQemRe1xC/ZVTC5uLr9RMhKq9Kn6fV9lzoTiriKVEj3bBWaj9A3XY2Wzn 4plw== MIME-Version: 1.0 X-Received: by 10.202.105.211 with SMTP id e202mr17310189oic.134.1423640205603; Tue, 10 Feb 2015 23:36:45 -0800 (PST) Received: by 10.202.102.65 with HTTP; Tue, 10 Feb 2015 23:36:45 -0800 (PST) In-Reply-To: <14dfea4e.18e1.14b764a95f9.Coremail.c77_cn@163.com> References: <3fe78fb9.5182.14b6c6317a4.Coremail.c77_cn@163.com> <75674ac4.d797.14b71242cbb.Coremail.c77_cn@163.com> <6A80D478-1C13-45E9-BEF8-3AC14AAD3CD9@gmail.com> <14dfea4e.18e1.14b764a95f9.Coremail.c77_cn@163.com> Date: Wed, 11 Feb 2015 09:36:45 +0200 Message-ID: Subject: Re: Re: Stopping ntpd signals SIGTERM, then causes namenode exit From: Or Sher To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a114034e0be3a9e050ecb1036 X-Virus-Checked: Checked by ClamAV on apache.org --001a114034e0be3a9e050ecb1036 Content-Type: text/plain; charset=UTF-8 I'm not sure it's related but I encountered a similar issue a few months ago. In my case, it was an "at" command sending a kill signal to the at daemon with it's correct pid. Somehow, once in a while this signal got to Cassandra process (Java as well) and killed it. After some time of investigation I assumed this have to be a kernel bug or something and I've opened a ticket for CentOS - http://bugs.centos.org/view.php?id=7539 which no body is really looking at :) You can read there how I tried to tackle it. Bottom line, we've changed the at scheduler to a different implementation and we don't get this issue any more. HTH, Or. On Wed, Feb 11, 2015 at 3:39 AM, David chen wrote: > The command 'service ntpd stop' could be triggered around 14:00. > Because the crontab was set as follows: > 0 * * * * sh sync.sh > The script contains the following command: > #!/bin/bash > service ntpd stop > ntpdate 192.168.0.1 #it's a valid ntpd server in LAN > service ntpd start > chkconfig ntpd on > > Found the following fragment in /var/log/message: > Jan 7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15 > Jan 7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22 > 11:23:27 UTC 2013 (1) > Jan 7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec > Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard, > 0.0.0.0#123 Disabled > Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard, > ::#123 Disabled > Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123 > Enabled > Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2, > fe80::ca1f:66ff:fee1:eed#123 Enabled > Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo, > 127.0.0.1#123 Enabled > Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2, > 192.168.1.151#123 Enabled > Jan 7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22 > for interface updates > Jan 7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040 > Jan 7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from > /var/lib/ntp/drift > Jan 7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished, > exiting > Jan 7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2 > Jan 7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001 > Jan 7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal... > shutting down... > Jan 7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing > attributes. > Jan 7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5 > Jan 7 14:52:48 host1 ntpd[44765]: no servers reachable > > It looks likely the command 'service ntpd stop' send the SIGTERM signal. > The above clue 'ntpd[32101]' indicates that the ntpd process PID is 32101, > inspect NameNode log, i found that the NameNode process PID was not > identical with ntpd. > So i wonder why Namenode process can received the signal? > -- Or Sher --001a114034e0be3a9e050ecb1036 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I'm not sure it's related but I encountered a simi= lar issue a few months ago.
In my case, it was an "at" comman= d sending a kill signal to the at daemon with it's correct pid.
Somehow, once in a while this signal got to Cassandra process (Java as w= ell) and killed it.
After some time of investigation I assumed th= is have to be a kernel bug or something and I've opened a ticket for Ce= ntOS -=C2=A0http://bu= gs.centos.org/view.php?id=3D7539=C2=A0which no body is really looking a= t :)
You can read there how I tried to tackle it.
Botto= m line, we've changed the at scheduler to a different implementation an= d we don't get this issue any more.

HTH,
=
Or.


On Wed, Feb 11, 2015 at 3:39 AM, David chen <c77_cn@163.com= > wrote:
The comman= d 'service ntpd stop' could be triggered around 14:00.
Be= cause the crontab was set as follows:
0 * * * * sh sync.sh
<= div>The script contains the following command:
#= !/bin/bash
service ntpd stop
ntpdate 192.168.0.1 #it= 9;s a valid ntpd server in LAN
service ntpd start
chkco= nfig ntpd on

Found the following fragment i= n /var/log/message:=C2=A0
Jan =C2=A07 14:00:01 h= ost1 ntpd[32101]: ntpd exiting on signal 15
Jan =C2=A07 13:59:59 = host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22 11:23:27 UTC 2013 (1)
Jan =C2=A07 13:59:59 host1 ntpd[44765]: precision =3D 0.143 usec
Jan =C2=A07 13:59:59 host1 ntpd[44765]: Listening on interface #0 = wildcard, 0.0.0.0#123 Disabled
Jan =C2=A07 13:59:59 host1 ntpd[44= 765]: Listening on interface #1 wildcard, ::#123 Disabled
Jan =C2= =A07 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123 Enab= led
Jan =C2=A07 13:59:59 host1 ntpd[44765]: Listening on interfac= e #3 em2, fe80::ca1f:66ff:fee1:eed#123 Enabled
Jan =C2=A07 13:59:= 59 host1 ntpd[44765]: Listening on interface #4 lo, 127.0.0.1#123 Enabled
Jan =C2=A07 13:59:59 host1 ntpd[44765]: Listening on interface #5 = em2, 192.168.1.151#123 Enabled
Jan =C2=A07 13:59:59 host1 ntpd[44= 765]: Listening on routing socket on fd #22 for interface updates
Jan =C2=A07 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
=
Jan =C2=A07 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 = PPM from /var/lib/ntp/drift
Jan =C2=A07 14:00:01 host1 ntpd_initr= es[32103]: parent died before we finished, exiting
Jan =C2=A07 14= :04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
Jan =C2=A07 14:04:17 host1 ntpd[44765]: kernel time sync status change 20= 01
Jan =C2=A07 14:26:02 host1 snmpd[4842]: Received TERM or STOP = signal... =C2=A0shutting down...
Jan =C2=A07 14:26:02 host1 kerne= l: netlink: 12 bytes leftover after parsing attributes.
Jan =C2= =A07 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
Jan =C2=A0= 7 14:52:48 host1 ntpd[44765]: no servers reachable

It looks likely the command 'service ntpd stop' send the SI= GTERM signal. The above clue 'ntpd[32101]' indicates that the ntpd = process PID is 32101, inspect NameNode log, i found that the NameNode proce= ss PID was not identical with ntpd.=C2=A0
So i wonder why Namenod= e process can received the signal?



--
Or Sher
--001a114034e0be3a9e050ecb1036--