Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 56598 invoked from network); 16 Jun 2009 10:59:20 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Jun 2009 10:59:20 -0000 Received: (qmail 94060 invoked by uid 500); 16 Jun 2009 10:59:30 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 93987 invoked by uid 500); 16 Jun 2009 10:59:30 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 93903 invoked by uid 99); 16 Jun 2009 10:59:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Jun 2009 10:59:30 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Jun 2009 10:59:28 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id CEB8A234C044 for ; Tue, 16 Jun 2009 03:59:07 -0700 (PDT) Message-ID: <2055474491.1245149947831.JavaMail.jira@brutus> Date: Tue, 16 Jun 2009 03:59:07 -0700 (PDT) From: "Steve Loughran (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-5478) Provide a node health check script and run it periodically to check the node health status In-Reply-To: <138607381.1236871130868.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720061#action_12720061 ] Steve Loughran commented on HADOOP-5478: ---------------------------------------- 1. The timeouts in Shell would seem useful on their own; every shell operation ought to have timeouts for extra robustness. 2. This would fit fairly well under the HADOOP-3628 stuff, where the monitor would start and stop with the TT lifecycle; we'd have to think about how to integrate it with the ping operation -I think returning the most recent status would be good. 3. At some point in the future, it would be good for the policy of acting on TT failure to be moved out of the JT. In infrastructure where the response to failure is to terminate that (virtual) host and ask for a new one, you react very differently to failure. It's not something the JT needs to handle, other than pass up bad news. 4. I'm not sure about all the kill -9 and shutdown hook stuff, it's getting into fragile waters. Hard to test, hard to debug, creates complex situations especially in test runs or stuff hosted in different runtimes * this helper script stuff must be optional; I would turn it off on my systems as I test health in different ways. * kill handlers are best designed to do very little and be robust against odd system states -and not assume any other parts of the cluster are live. For the curious, the way SmartFrog manages is its health is that every component tracks the last time it was asked by its parent for its health, if that time ever exceeds a (programmed) limit then it terminates itself. Every process pings the root component; its up to that to ping its children and act on failures -and to recognise and act on timeouts. This works OK for single host work, in a cluster you don't want any SPOFs and tend to take an aggregate view : there has to be one Namenode, one JT, "enough" workers. I have a component to check the health of a file in the filesystem; every time it's health is checked, it looks for the file it was bound to, checks that it is present and within a specified size range. This is handy for checking that files you value are there, and that the FS is visible across the network (very important on virtual servers with odd networking). I dont have anything similar for checking that TT's are good, the best check would be test work. > Provide a node health check script and run it periodically to check the node health status > ------------------------------------------------------------------------------------------ > > Key: HADOOP-5478 > URL: https://issues.apache.org/jira/browse/HADOOP-5478 > Project: Hadoop Core > Issue Type: New Feature > Components: mapred > Affects Versions: 0.20.0 > Reporter: Aroop Maliakkal > Assignee: Vinod K V > Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch, hadoop-5478-3.patch, hadoop-5478-4.patch > > > Hadoop must have some mechanism to find the health status of a node . It should run the health check script periodically and if there is any errors, it should black list the node. This will be really helpful when we run static mapred clusters. Else we may have to run some scripts/daemons periodically to find the node status and take it offline manually. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.