Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 80112 invoked from network); 10 Aug 2008 06:00:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Aug 2008 06:00:06 -0000 Received: (qmail 8188 invoked by uid 500); 10 Aug 2008 06:00:04 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 8159 invoked by uid 500); 10 Aug 2008 06:00:04 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 8146 invoked by uid 99); 10 Aug 2008 06:00:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 09 Aug 2008 23:00:04 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 10 Aug 2008 05:59:16 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 5EE41234C1AA for ; Sat, 9 Aug 2008 22:59:44 -0700 (PDT) Message-ID: <57528323.1218347984387.JavaMail.jira@brutus> Date: Sat, 9 Aug 2008 22:59:44 -0700 (PDT) From: "dhruba borthakur (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-3585) Hardware Failure Monitoring in large clusters running Hadoop/HDFS In-Reply-To: <815467284.1213736445958.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621212#action_12621212 ] dhruba borthakur commented on HADOOP-3585: ------------------------------------------ I get a compilation error : init-contrib: compile: [echo] contrib: failmon jar: BUILD FAILED /export/home/dhruba/commit/build.xml:900: The following error occurred while executing this line: /export/home/dhruba/commit/src/contrib/build.xml:39: The following error occurred while executing this line: /export/home/dhruba/commit/src/contrib/failmon/build.xml:28: The following error occurred while executing this line: /export/home/dhruba/commit/build.xml:251: java.lang.ExceptionInInitializerError Total time: 44 seconds The unit tests have failed for some other reason ( I think) : http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3035/testReport/ > Hardware Failure Monitoring in large clusters running Hadoop/HDFS > ----------------------------------------------------------------- > > Key: HADOOP-3585 > URL: https://issues.apache.org/jira/browse/HADOOP-3585 > Project: Hadoop Core > Issue Type: New Feature > Environment: Linux > Reporter: Ioannis Koltsidas > Priority: Minor > Attachments: FailMon-standalone.zip, failmon.pdf, failmon.pdf, failmon2.pdf, FailMon_Package_descrip.html, FailMon_QuickStart.html, HADOOP-3585.2.patch, HADOOP-3585.patch, HADOOP-3585.patch > > Original Estimate: 480h > Remaining Estimate: 480h > > At IBM we're interested in identifying hardware failures on large clusters running Hadoop/HDFS. We are working on a framework that will enable nodes to identify failures on their hardware using the Hadoop log, the system log and various OS hardware diagnosing utilities. The implementation details are not very clear, but you can see a draft of our design in the attached document. We are pretty interested in Hadoop and system logs from failed machines, so if you are in possession of such, you are very welcome to contribute them; they would be of great value for hardware failure diagnosing. > Some details about our design can be found in the attached document failmon.doc. More details will follow in a later post. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.