hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5478) Provide a node health check script and run it periodically to check the node health status
Date Wed, 17 Jun 2009 13:20:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720655#action_12720655

Steve Loughran commented on HADOOP-5478:

# timestamps would help, would push more analysis up to whoever is asking the TT.
# If the main RPC is so owverwhelmed it can't answer a health query, that's a sign of a problem
anyway. Your TT is no longer _live_

The health checking would fit in with the notion of a {{Ping}} operation, as raised in HADOOP-5622.
Every service should have a way of saying "are you up". The failure to answer the query: trouble.
If the call returns with an error: trouble. If the call returns saying it is well, then all
you know is that the service thinks it is well, but still may not be capable of useful work.

What this operation does do is set more requirements on what gets returned -you probably want
to return something machine readable, that can be extended by different services, depending
on their view of the world. A hashtable containing writable stuff, perhaps. 

> Provide a node health check script and run it periodically to check the node health status
> ------------------------------------------------------------------------------------------
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Sreekanth Ramakrishnan
>         Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch, hadoop-5478-3.patch, hadoop-5478-4.patch,
> Hadoop must have some mechanism to find the health status of a node . It should run the
health check script periodically and if there is any errors, it should black list the node.
This will be really helpful when we run static mapred clusters. Else we may have to run some
scripts/daemons periodically to find the node status and take it offline manually.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message