hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Han-Cheol Cho <hancheol....@nhn-playart.com>
Subject Re: A problem with Hadoop PID files
Date Mon, 27 Oct 2014 10:22:06 GMT
Hi, Vikas
Thank you for your reply.
I understand that this problem can be solved by using the method you suggested and actually
I did it for a few times while digging into the reason of this problem. 
But I don't want to fix this problem manually since it can happen even while I'm sleeping
at 4:00 AM :-)
As a working solution, I am currently using the following start and stop commands. 
check process hadoop-hdfs-namenode with pidfile /var/run/hadoop-hdfs/hadoop-hdfs-namenode.pid
  start program = "/bin/bash -c 'pkill -15 -f org\.apache\.hadoop\.hdfs\.server\.namenode\.NameNode;
service hadoop-hdfs-namenode start'"
  stop  program = "/bin/bash -c 'pkill -15 -f org\.apache\.hadoop\.hdfs\.server\.namenode\.NameNode'"
It uses the same pidfile option to check the status of a daemon, but stops the daemon using
"pkill" command instead of "service ... stop"
Therefore, the Monit daemon can stop the running daemon (not the one specified by the pidfile)
even if the pidfile has a wrong number PID.

Although I am not sure how many people in this mailing list are interested in this subject,
but hope this is helpful for someone.
Best wishes,
-----Original Message-----
From: "vikas srivastava"&lt;vikas_srivastava@apple.com&gt; 
To: "Han-Cheol Cho"&lt;hancheol.cho@nhn-playart.com&gt;; 
Sent: 2014-10-27 (月) 19:00:11
Subject: Re: A problem with Hadoop PID files

Hey , Just go and delete the file or just put the correct pid inside hdfs-namenode.pid thanks

On Oct 21, 2014, at 9:47 PM, Han-Cheol Cho &lt;hancheol.cho@nhn-playart.com&gt; wrote:
Hi all,
  I am using Monit to monitor hadoop processes and automatically restart them when failed.
  From time to time, however, a hadoop process (e.g., namenode) runs with the PID, saying
1111, while its pid file (in /var/run/hadoop-hdfs/hadoop-hdfs-namenode.pid) has a different
value, saying 1222.
Monit assumes that the service is not running and tries to re-run it using the specified command
"/sbin/service hadoop-hdfs-namenode start".
The problem is that the Namenode is already running (with a different pid from the pid file).
Therefore, the service command fails, but it renews the pid file so that the number in this
file is just growing again and again...
  Probably, Monit, after it found the Namenode is not running, relaunches the Namenode multiple
times shortly; as a result, the first one goes up but the second one overwrites the pid file.
And the launching script also does not seem to have any lock routine to protect the pid file.
  Is there anyone who had experienced a similar problem?
Temporarily, I am using a workaround to stop the process (kill -15 pid) since "service ...
stop" also does not work. 
  Best wishes,

 趙漢哲  (CHO, Han-Cheol. Ph.D)
データ研究室   / 社員 (Data Science Lab.   / Data scientist)
TEL: 03-5155-1160 (部署代表)   FAX: 03-5155-3307
 〒150-8510 東京都渋谷区渋谷2-21-1 渋谷ヒカリエ 27階
Email  hancheol.cho@nhn-playart.com   Messenger   


View raw message