hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5478) Provide a node health check script and run it periodically to check the node health status
Date Thu, 18 Jun 2009 16:16:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721327#action_12721327
] 

Hemanth Yamijala commented on HADOOP-5478:
------------------------------------------

bq. What interface do admins have that make this obvious? If a cluster has 2500 TTs, it isn't
going to be obvious in a web UI that any given TT is sick.

Allen, just to be clear, nodes detected as unhealthy by the health check script are blacklisted
as was discussed and as you commented in one of your earliest comments on this JIRA. Therefore
the 'mapred job -list-blacklisted-trackers' will easily point out these.

We are only discussing a specific case where the health checker VM exits on a node, say because
of some transient problem on the TT, *but* the TT is still up, and possibly unhealthy. It
was only for detecting these TTs that I was talking about the timestamps. I agree with you
that it will be difficult to look this information up on a web UI. But the fact that we are
having this information centrally will help us build / enhance existing tools. 

I believe even with the current scope this feature is worth a try on live clusters. Given
we have no way of detecting unhealthy nodes now, this is an improvement. If indeed we find
many instances where the health checker is going down while TTs are up, we can easily provide
additional tools based on the information currently being reported. Hope that makes sense.

> Provide a node health check script and run it periodically to check the node health status
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Sreekanth Ramakrishnan
>         Attachments: active.png, blacklist1.png, cluster_setup.pdf, hadoop-5478-1.patch,
hadoop-5478-2.patch, hadoop-5478-3.patch, hadoop-5478-4.patch, hadoop-5478-5.patch, hadoop-5478-6.patch
>
>
> Hadoop must have some mechanism to find the health status of a node . It should run the
health check script periodically and if there is any errors, it should black list the node.
This will be really helpful when we run static mapred clusters. Else we may have to run some
scripts/daemons periodically to find the node status and take it offline manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message