hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sreekanth Ramakrishnan (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-5478) Provide a node health check script and run it periodically to check the node health status
Date Fri, 29 May 2009 03:48:45 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Sreekanth Ramakrishnan updated HADOOP-5478:

    Attachment: hadoop-5478-1.patch

Attaching first cut patch to address the issue: 

The patch does following:

* Patch requires two configuration items to be present in TaskTracker nodes, {{mapred.tasktracker.health_check_script}}
and {{mapred.tasktracker.health_check_interval}} the {{mapred.tasktracker.health_check_script}}
needs to be absolute path to script file. If the file does not exist when the TT starts up
then the monitor is turned off.
* The monitor periodically runs the shell script. It ignores the exit code of the shell script,
gets the output from the script, searches for a pattern "ERROR" in the output. 
* If ERROR is present in output, the monitor, sets health of the node as unhealthy and puts
entire output as status to be set to JT.
* JT then depending on the value of the health of the node, decides to blacklist or white
list the node.
* Attached test case which tests black listing and white listing as per output of the script.

> Provide a node health check script and run it periodically to check the node health status
> ------------------------------------------------------------------------------------------
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Vinod K V
>         Attachments: hadoop-5478-1.patch
> Hadoop must have some mechanism to find the health status of a node . It should run the
health check script periodically and if there is any errors, it should black list the node.
This will be really helpful when we run static mapred clusters. Else we may have to run some
scripts/daemons periodically to find the node status and take it offline manually.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message