hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5478) Provide a node health check script and run it periodically to check the node health status
Date Tue, 16 Jun 2009 10:59:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720061#action_12720061
] 

Steve Loughran commented on HADOOP-5478:
----------------------------------------

1. The timeouts in Shell would seem useful on their own; every shell operation ought to have
timeouts for extra robustness.

2. This would fit fairly well under the HADOOP-3628 stuff, where the monitor would start and
stop with the TT lifecycle; we'd have to think about how to integrate it with the ping operation
-I think returning the most recent status would be good.


3. At some point in the future, it would be good for the policy of acting on TT failure to
be moved out of the JT. In infrastructure where the response to failure is to terminate that
(virtual) host and ask for a new one, you react very differently to failure. It's not something
the JT needs to handle, other than pass up bad news.

4. I'm not sure about all the kill -9 and shutdown hook stuff, it's getting into fragile waters.
Hard to test, hard to debug, creates complex situations especially  in test runs or stuff
hosted in different runtimes

* this helper script stuff must be optional; I would turn it off on my systems as I test health
in different ways.
* kill handlers are best designed to do very little and be robust against odd system states
-and not assume any other parts of the cluster are live.

For the curious, the way SmartFrog  manages is its health is that every component tracks the
last time it was asked by its parent for its health, if that time ever exceeds a (programmed)
limit then it terminates itself. Every process pings the root component; its up to that to
ping its children and act on failures -and to  recognise and act on timeouts. This works OK
for single host work, in a cluster you don't want any SPOFs and tend to take an aggregate
view : there has to be one Namenode, one JT, "enough" workers. I have a component to check
the health of a file in the filesystem; every time it's health is checked, it looks for the
file it was bound to, checks that it is present and within a specified size range. This is
handy for checking that files you value are there, and that the FS is visible across the network
(very important on virtual servers with odd networking). I dont have anything similar for
checking that TT's are good, the best check would be test work.

> Provide a node health check script and run it periodically to check the node health status
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Vinod K V
>         Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch, hadoop-5478-3.patch, hadoop-5478-4.patch
>
>
> Hadoop must have some mechanism to find the health status of a node . It should run the
health check script periodically and if there is any errors, it should black list the node.
This will be really helpful when we run static mapred clusters. Else we may have to run some
scripts/daemons periodically to find the node status and take it offline manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message