Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 6821 invoked from network); 16 Jun 2009 15:21:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Jun 2009 15:21:19 -0000 Received: (qmail 91431 invoked by uid 500); 16 Jun 2009 15:21:30 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 91299 invoked by uid 500); 16 Jun 2009 15:21:29 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 91269 invoked by uid 99); 16 Jun 2009 15:21:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Jun 2009 15:21:29 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Jun 2009 15:21:27 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 9BD0B234C4AB for ; Tue, 16 Jun 2009 08:21:07 -0700 (PDT) Message-ID: <130767386.1245165667637.JavaMail.jira@brutus> Date: Tue, 16 Jun 2009 08:21:07 -0700 (PDT) From: "Hong Tang (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Issue Comment Edited: (HADOOP-5478) Provide a node health check script and run it periodically to check the node health status In-Reply-To: <138607381.1236871130868.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720185#action_12720185 ] Hong Tang edited comment on HADOOP-5478 at 6/16/09 8:20 AM: ------------------------------------------------------------ bq. Hong, is this to check if the TT is alive ? In which case, did you mean another signal, like -0 or kill -3. -9 is SIGKILL and would kill the TT. Also, in that case are you suggesting that we could keep the health checker around and continue trying to report after a while ? @hemanth sorry for not being clear. I gave a bit more thoughts on the problem, and I think the following logic may be simpler and more robust (1,2 are the current logic, 3 is my suggestion) : (1) periodically launch the health checking script; (2) reporting status that back to TT (both good and bad); (3) if it fails to receive response from TT, wait for X seconds, do an extra kill (to ensure TT is dead), and quit itself. I scanned through the code, it seems that NodeHealthChecker.stop() would be a good place to perform step (3). was (Author: hong.tang): bq Hong, is this to check if the TT is alive ? In which case, did you mean another signal, like -0 or kill -3. -9 is SIGKILL and would kill the TT. Also, in that case are you suggesting that we could keep the health checker around and continue trying to report after a while ? @hemanth sorry for not being clear. I gave a bit more thoughts on the problem, and I think the following logic may be simpler and more robust (1,2 are the current logic, 3 is my suggestion) : (1) periodically launch the health checking script; (2) reporting status that back to TT (both good and bad); (3) if it fails to receive response from TT, wait for X seconds, do an extra kill (to ensure TT is dead), and quit itself. I scanned through the code, it seems that NodeHealthChecker.stop() would be a good place to perform step (3). > Provide a node health check script and run it periodically to check the node health status > ------------------------------------------------------------------------------------------ > > Key: HADOOP-5478 > URL: https://issues.apache.org/jira/browse/HADOOP-5478 > Project: Hadoop Core > Issue Type: New Feature > Components: mapred > Affects Versions: 0.20.0 > Reporter: Aroop Maliakkal > Assignee: Sreekanth Ramakrishnan > Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch, hadoop-5478-3.patch, hadoop-5478-4.patch, hadoop-5478-5.patch > > > Hadoop must have some mechanism to find the health status of a node . It should run the health check script periodically and if there is any errors, it should black list the node. This will be really helpful when we run static mapred clusters. Else we may have to run some scripts/daemons periodically to find the node status and take it offline manually. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.