hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3585) Hardware Failure Monitoring in large clusters running Hadoop/HDFS
Date Wed, 23 Jul 2008 05:22:31 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615886#action_12615886
] 

dhruba borthakur commented on HADOOP-3585:
------------------------------------------

It would be nice if we could do the folowing:

1. Remove the code changes from Namenode.java and DataNode.java. Instead run this app from
a bunch os shell scripts.

2. Move failmon.properties from conf to src/contrib/failmon/conf/ or something like that.

3. Make the code reside in src/contrib/failmon. Let it be a contrib project.

4. Write a junit test to test some amount of functionality. It could be based on standalone
class testing.

5. Integrate with the over build process so that "ant compile-contrib" builds FailMon too.
Similarly, "ant test" should run FailMon junit test(s).

6. Maybe some people from the chukwa project should browse this code and give a +1.

Once these are done, we should check this as a contrib project.



> Hardware Failure Monitoring in large clusters running Hadoop/HDFS
> -----------------------------------------------------------------
>
>                 Key: HADOOP-3585
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3585
>             Project: Hadoop Core
>          Issue Type: New Feature
>         Environment: Linux
>            Reporter: Ioannis Koltsidas
>            Priority: Minor
>         Attachments: FailMon-standalone.zip, failmon.pdf, failmon.pdf, FailMon_Package_descrip.html,
HADOOP-3585.patch
>
>   Original Estimate: 480h
>  Remaining Estimate: 480h
>
> At IBM we're interested in identifying hardware failures on large clusters running Hadoop/HDFS.
We are working on a framework that will enable nodes to identify failures on their hardware
using the Hadoop log, the system log and various OS hardware diagnosing utilities. The implementation
details are not very clear, but you can see a draft of our design in the attached document.
We are pretty interested in Hadoop and system logs from failed machines, so if you are in
possession of such, you are very welcome to contribute them; they would be of great value
for hardware failure diagnosing.
> Some details about our design can be found in the attached document failmon.doc. More
details will follow in a later post.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message