hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5478) Provide a node health check script and run it periodically to check the node health status
Date Wed, 17 Jun 2009 16:02:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720731#action_12720731

Steve Loughran commented on HADOOP-5478:

> Though, at both times, the only one who knows about the trouble is the health checker
and not the rest of the world.

why is why your management tools
# need to be HA toys themselves
# need to be able to ask the apps for their health
# may need to be able to do test jobs to probe system health
# may need the ability to react to failure according to the infrastructure in which HDFS is
running, and your policy. 

If HDFS is running in anything that supports the EC2 APIs, if a TT is playing up I'd start
by rebooting that node, if it still doesn't come up, decomission the namenode, terminate the
VM and ask for a new one. That's a very different policy from a physical cluster, where you
may want to blacklist the TT while its datanode services stays live. 

> Provide a node health check script and run it periodically to check the node health status
> ------------------------------------------------------------------------------------------
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Sreekanth Ramakrishnan
>         Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch, hadoop-5478-3.patch, hadoop-5478-4.patch,
> Hadoop must have some mechanism to find the health status of a node . It should run the
health check script periodically and if there is any errors, it should black list the node.
This will be really helpful when we run static mapred clusters. Else we may have to run some
scripts/daemons periodically to find the node status and take it offline manually.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message