hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5478) Provide a node health check script and run it periodically to check the node health status
Date Thu, 18 Jun 2009 03:37:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721042#action_12721042
] 

Hemanth Yamijala commented on HADOOP-5478:
------------------------------------------

Folks, I still maintain that the focus of this jira is just checking health of the node as
determined by an administrator supplied script. The last few comments are focusing more on
health of a TT. For the purpose of making incremental progress, let us stick to the original
scope and defer discussions of checking the health on the TT, and corrective actions there-of,
to a separate jira.

So, to summarize, the health checker kills itself if it cannot communicate with the TT (similar
to the child JVM). If this happens because the TT is down, well and good. The 'lost tasktracker'
logic of the jobtracker would ensure this status is captured. If this happens because the
TT was overwhelmed, well, maybe the TT is not 'healthy' any more. But the fact that we are
reporting timestamps of the last health status gives the administrators an opportunity to
know that something is amiss on this node, because it's health has not been updated for a
while. Either way we can alert ourselves to problems. So, the purpose is still solved. Of
course, there are better, more automated ways to do it. That would qualify for a next increment.

Hope this makes sense.

> Provide a node health check script and run it periodically to check the node health status
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Sreekanth Ramakrishnan
>         Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch, hadoop-5478-3.patch, hadoop-5478-4.patch,
hadoop-5478-5.patch
>
>
> Hadoop must have some mechanism to find the health status of a node . It should run the
health check script periodically and if there is any errors, it should black list the node.
This will be really helpful when we run static mapred clusters. Else we may have to run some
scripts/daemons periodically to find the node status and take it offline manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message