hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5478) Provide a node health check script and run it periodically to check the node health status
Date Fri, 12 Jun 2009 07:30:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718749#action_12718749

Hemanth Yamijala commented on HADOOP-5478:

To summarize some of the comments above, the issue we are discussing is whether the node health
checker script should be launched as a separate process from the tasktracker (TT) itself,
rather than as a thread in the TT, as done in the patch currently. There are some motivations
for doing the same:

- A periodic process launch from a java service like the TT has caused problems in the past
- for e.g. look at HADOOP-5059.
- Owen also mentioned instances where they'd seen the service itself lock up due to the process
launch (and the underlying fork()/exec()) failing.

So, the proposal is to solve this problem by having the node health checker script as a separate
process. This process can be configured with the following:
- Path to a script
- An interval
- TT's address for communication.

The process would periodically run the script (as done in the patch today) and report the
status to the TT using RPC. To keep management simple, we can, in the first cut, launch this
process from the TT itself and stop it when the TT is going down. In future, it should be
possible to decouple this even more and have them run independently. The simplicity we buy
in the first iteration is to not require administrators from worrying about managing this
independently for the time being - until we gain some experience with how the health check
script is running.

Does this sound fine ?

> Provide a node health check script and run it periodically to check the node health status
> ------------------------------------------------------------------------------------------
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Vinod K V
>         Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch
> Hadoop must have some mechanism to find the health status of a node . It should run the
health check script periodically and if there is any errors, it should black list the node.
This will be really helpful when we run static mapred clusters. Else we may have to run some
scripts/daemons periodically to find the node status and take it offline manually.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message