hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5478) Provide a node health check script and run it periodically to check the node health status
Date Tue, 16 Jun 2009 09:00:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719991#action_12719991

Hemanth Yamijala commented on HADOOP-5478:

To add to Sreekanth's comments:

- We are using a new port number for the TT to bind to for the health checker script to send
updates. The other option was to use the same port as that used for the TaskUmbilicalProtocol.
We thought the health service should not mix with child tasks reporting status and hence kept
it different.

- The other important point is about how the health checker stops. Currently, the model is
similar to how a child stops, in that if it can't report status to the TT, it kills itself.
This is anyway required because it has to handle the case of the TT dying unexpectedly. However
this is the extreme case. When the TT is stopped normally there are better options to stop
the health check script. For e.g. we could add a shutdown hook to TT and send a signal to
the health checker. We could make the health checker a separate daemon as well so that stop-mapred
could stop it. Any of these options can be easily implemented as a follow-up once the basic
structure is in place.

Please let us know if these points make sense.

> Provide a node health check script and run it periodically to check the node health status
> ------------------------------------------------------------------------------------------
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Vinod K V
>         Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch, hadoop-5478-3.patch
> Hadoop must have some mechanism to find the health status of a node . It should run the
health check script periodically and if there is any errors, it should black list the node.
This will be really helpful when we run static mapred clusters. Else we may have to run some
scripts/daemons periodically to find the node status and take it offline manually.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message