Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 9613 invoked from network); 13 May 2009 10:42:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 May 2009 10:42:09 -0000 Received: (qmail 74318 invoked by uid 500); 13 May 2009 10:42:08 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 74253 invoked by uid 500); 13 May 2009 10:42:08 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 74239 invoked by uid 99); 13 May 2009 10:42:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 May 2009 10:42:08 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 May 2009 10:42:05 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id A7973234C053 for ; Wed, 13 May 2009 03:41:45 -0700 (PDT) Message-ID: <1532221164.1242211305685.JavaMail.jira@brutus> Date: Wed, 13 May 2009 03:41:45 -0700 (PDT) From: "Hemanth Yamijala (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-5478) Provide a node health check script and run it periodically to check the node health status In-Reply-To: <138607381.1236871130868.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708879#action_12708879 ] Hemanth Yamijala commented on HADOOP-5478: ------------------------------------------ bq. this fits in well with the ping/liveness stuff I've been doing This coupled with your comment on ping() returning the latest results seems to indicate that we have a thread that periodically executes and stores the results. In that sense, maybe we could build this solution now, and when HADOOP-3628 is committed to trunk, we could integrate the solution and results to be returned as part of ping(). Does that make sense ? bq. It may be handy to have this stuff independent of the TT itself, so you can run a node-health checker on anything, and even if the TT refuses to play, you could do some checking of the node. The health check script itself is definitely external and could be anything. All the TT would provide is the ability to run it periodically. So, I can imagine this being run standalone, or integrated with another daemon that provides a similar interface. bq. Also, could it be a bit of JavaScript instead of a shell script? Umm. Can we execute this from the TT directly ? AFAIK, this is not possible, right ? As of now, there is no plan to support anything other than a shell script. bq. A scenario to worry about is what if something bad happens (e.g. a bit of NFS goes away) that causes all health checks in a big cluster to fail simultaneously. Would this overload the JT? Since the plan is to send the information using the heartbeats itself, handling the load of requests should not be a problem. I am not sure how costly blacklist processing itself is on the JT, but hopefully not bad. We'll keep this in mind though. > Provide a node health check script and run it periodically to check the node health status > ------------------------------------------------------------------------------------------ > > Key: HADOOP-5478 > URL: https://issues.apache.org/jira/browse/HADOOP-5478 > Project: Hadoop Core > Issue Type: New Feature > Components: mapred > Affects Versions: 0.20.0 > Reporter: Aroop Maliakkal > Assignee: Vinod K V > > Hadoop must have some mechanism to find the health status of a node . It should run the health check script periodically and if there is any errors, it should black list the node. This will be really helpful when we run static mapred clusters. Else we may have to run some scripts/daemons periodically to find the node status and take it offline manually. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.