Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Message-ID: <2055474491.1245149947831.JavaMail.jira@brutus>
Date: Tue, 16 Jun 2009 03:59:07 -0700 (PDT)
From: "Steve Loughran (JIRA)" <jira@apache.org>
To: core-dev@hadoop.apache.org
Subject: [jira] Commented: (HADOOP-5478) Provide a node health check script
 and run it periodically to check the node health status
In-Reply-To: <138607381.1236871130868.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720061#action_12720061 ] 

Steve Loughran commented on HADOOP-5478:
----------------------------------------

1. The timeouts in Shell would seem useful on their own; every shell operation ought to have timeouts for extra robustness.

2. This would fit fairly well under the HADOOP-3628 stuff, where the monitor would start and stop with the TT lifecycle; we'd have to think about how to integrate it with the ping operation -I think returning the most recent status would be good.


3. At some point in the future, it would be good for the policy of acting on TT failure to be moved out of the JT. In infrastructure where the response to failure is to terminate that (virtual) host and ask for a new one, you react very differently to failure. It's not something the JT needs to handle, other than pass up bad news.

4. I'm not sure about all the kill -9 and shutdown hook stuff, it's getting into fragile waters. Hard to test, hard to debug, creates complex situations especially  in test runs or stuff hosted in different runtimes

* this helper script stuff must be optional; I would turn it off on my systems as I test health in different ways.
* kill handlers are best designed to do very little and be robust against odd system states -and not assume any other parts of the cluster are live.

For the curious, the way SmartFrog  manages is its health is that every component tracks the last time it was asked by its parent for its health, if that time ever exceeds a (programmed) limit then it terminates itself. Every process pings the root component; its up to that to ping its children and act on failures -and to  recognise and act on timeouts. This works OK for single host work, in a cluster you don't want any SPOFs and tend to take an aggregate view : there has to be one Namenode, one JT, "enough" workers. I have a component to check the health of a file in the filesystem; every time it's health is checked, it looks for the file it was bound to, checks that it is present and within a specified size range. This is handy for checking that files you value are there, and that the FS is visible across the network (very important on virtual servers with odd networking). I dont have anything similar for checking that TT's are good, the best check would be test work.

> Provide a node health check script and run it periodically to check the node health status
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Vinod K V
>         Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch, hadoop-5478-3.patch, hadoop-5478-4.patch
>
>
> Hadoop must have some mechanism to find the health status of a node . It should run the health check script periodically and if there is any errors, it should black list the node. This will be really helpful when we run static mapred clusters. Else we may have to run some scripts/daemons periodically to find the node status and take it offline manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.