hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ioannis Koltsidas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3585) Hardware Failure Monitoring in large clusters running Hadoop/HDFS
Date Fri, 04 Jul 2008 20:11:49 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12610651#action_12610651

Ioannis Koltsidas commented on HADOOP-3585:

We have uploaded an initial version of our tool. Using the patch for the trunk code, one can
run FailMon on every DataNode and NameNode. All data gathered are uploaded into HDFS.

Also provided is an OfflineAnonymizer that anonymizes system and hadoop log files, so that
they can be easily distributed.

Details can be found in the attached FailMon_Package_Descrip.html.

Our greatest concern now is to be able to identify read hardware failures from the gathered
data. To that end, we need to gather as many data from real clusters as possible, to that
we can see how all kinds of errors and failures are actually logged by the system and hadoop.
By correlating them, we will be able to systematically identify actual failures. 

So, you are very welcome to use our patch and share the collected data and/or anonymize and
share any log files you may already have ;)

> Hardware Failure Monitoring in large clusters running Hadoop/HDFS
> -----------------------------------------------------------------
>                 Key: HADOOP-3585
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3585
>             Project: Hadoop Core
>          Issue Type: New Feature
>         Environment: Linux
>            Reporter: Ioannis Koltsidas
>            Priority: Minor
>         Attachments: FailMon-standalone.zip, failmon.pdf, FailMon_Package_descrip.html,
>   Original Estimate: 480h
>  Remaining Estimate: 480h
> At IBM we're interested in identifying hardware failures on large clusters running Hadoop/HDFS.
We are working on a framework that will enable nodes to identify failures on their hardware
using the Hadoop log, the system log and various OS hardware diagnosing utilities. The implementation
details are not very clear, but you can see a draft of our design in the attached document.
We are pretty interested in Hadoop and system logs from failed machines, so if you are in
possession of such, you are very welcome to contribute them; they would be of great value
for hardware failure diagnosing.
> Some details about our design can be found in the attached document failmon.doc. More
details will follow in a later post.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message