hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rick Cox (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3585) Hardware Failure Monitoring in large clusters running Hadoop/HDFS
Date Wed, 09 Jul 2008 02:02:33 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611877#action_12611877

Rick Cox commented on HADOOP-3585:

This effort seems independent of providing a distributed file system (as evidenced by the
availability of a standalone version). Could the implementation be decoupled from the DataNode/NameNode
daemons? Many users will find this sort of hardware failure detection useful for their entire
set of hosts (including nodes that are not otherwise running any hadoop daemons). Conversely,
many Hadoop users will already be running software with similar functionality, and will not
need or want Hadoop to provide it bundled with the DataNodes.

In that light, it seems like this would make more sense as a piece that can evolve independently
of the Hadoop core releases, either as a sub-project or incubator project (I don't know what
the Apache rules regrading those are) or as a contrib module (though that has the disadvantage
of coupling the release cycles).

> Hardware Failure Monitoring in large clusters running Hadoop/HDFS
> -----------------------------------------------------------------
>                 Key: HADOOP-3585
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3585
>             Project: Hadoop Core
>          Issue Type: New Feature
>         Environment: Linux
>            Reporter: Ioannis Koltsidas
>            Priority: Minor
>         Attachments: FailMon-standalone.zip, failmon.pdf, FailMon_Package_descrip.html,
>   Original Estimate: 480h
>  Remaining Estimate: 480h
> At IBM we're interested in identifying hardware failures on large clusters running Hadoop/HDFS.
We are working on a framework that will enable nodes to identify failures on their hardware
using the Hadoop log, the system log and various OS hardware diagnosing utilities. The implementation
details are not very clear, but you can see a draft of our design in the attached document.
We are pretty interested in Hadoop and system logs from failed machines, so if you are in
possession of such, you are very welcome to contribute them; they would be of great value
for hardware failure diagnosing.
> Some details about our design can be found in the attached document failmon.doc. More
details will follow in a later post.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message