hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3585) Hardware Failure Monitoring in large clusters running Hadoop/HDFS
Date Wed, 09 Jul 2008 01:22:31 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611866#action_12611866

Runping Qi commented on HADOOP-3585:

Looks like this will be complementary with Chukwa project:
Chukwa is an hdfs based storage system for collecting and mining log data.
Chukwa will provide simple APIs for applications to push log data (and metrics data, or any
kind of semi structured data) to  the storage.
Once the data get to the storage, one can run map/reduce jobs or pig jobs to mine the data.
Currently, we are planning to implement a local agent that will collect the log files of Hadoop
service processes (Data nodes, Name nodes, Task trackers, etc) and push the data to Chukwa
storage. This agent will be running on a machine outside of Hadoop processes.
This agent may also be used for collecting system and other application metrics.

There seems to be two possible ways the failmon proposed in this Jira can work with Chukwa.
One is that it pushes data to Chukwa directly using Chukwa APIs. 
The other one is to produce log files and let Chukwa agent to push the data to Chukwa.

> Hardware Failure Monitoring in large clusters running Hadoop/HDFS
> -----------------------------------------------------------------
>                 Key: HADOOP-3585
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3585
>             Project: Hadoop Core
>          Issue Type: New Feature
>         Environment: Linux
>            Reporter: Ioannis Koltsidas
>            Priority: Minor
>         Attachments: FailMon-standalone.zip, failmon.pdf, FailMon_Package_descrip.html,
>   Original Estimate: 480h
>  Remaining Estimate: 480h
> At IBM we're interested in identifying hardware failures on large clusters running Hadoop/HDFS.
We are working on a framework that will enable nodes to identify failures on their hardware
using the Hadoop log, the system log and various OS hardware diagnosing utilities. The implementation
details are not very clear, but you can see a draft of our design in the attached document.
We are pretty interested in Hadoop and system logs from failed machines, so if you are in
possession of such, you are very welcome to contribute them; they would be of great value
for hardware failure diagnosing.
> Some details about our design can be found in the attached document failmon.doc. More
details will follow in a later post.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message