Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 13C541098 for ; Tue, 19 Apr 2011 23:12:47 +0000 (UTC) Received: (qmail 93353 invoked by uid 500); 19 Apr 2011 23:12:47 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 93315 invoked by uid 500); 19 Apr 2011 23:12:46 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 93306 invoked by uid 99); 19 Apr 2011 23:12:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Apr 2011 23:12:46 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Apr 2011 23:12:44 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id D47C2A90C6 for ; Tue, 19 Apr 2011 23:12:05 +0000 (UTC) Date: Tue, 19 Apr 2011 23:12:05 +0000 (UTC) From: "Jagane Sundar (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <1745917869.68504.1303254725867.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1394813331.25330.1301597106170.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (MAPREDUCE-2413) TaskTracker should handle disk failures at both startup and runtime MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAPREDUCE-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021851#comment-13021851 ] Jagane Sundar commented on MAPREDUCE-2413: ------------------------------------------ >> Why do you call localStorage.isDiskFailed and then ignore the results? Here is some more context as to why we're ignoring the return value from the call to isDiskFailed(): LocalStorage.isDiskFailed() returns true if a disk has failed since the last time this method was called. When called from initialize(), we're calling it only to reset the state. Also, Owen I would like to add to Ravi's comment regarding the following comment that you make: >> Rather than setting the "conf" attribute for the http server, you should set an attribute with the localStorage object. All uses of MAPRED_LOCALDIR_PROPERTY should be removed, other than the original creation of the localStorage. Furthermore, the property should never be set. This change will result in a lot of changes to existing code. I am not certain that these changes are worth the effort. I acknowledge that the software will be more elegant if written the way that you are proposing, but my concern is that we will end up changing a lot of code that is already inelegant in its use of the MAPRED_LOCALDIR_PROPERTY. Our desire is to keep changes limited in scope, I am requesting that you accept the patch as Ravi has last submitted it, without this change. > TaskTracker should handle disk failures at both startup and runtime > ------------------------------------------------------------------- > > Key: MAPREDUCE-2413 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2413 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task-controller, tasktracker > Affects Versions: 0.20.204.0 > Reporter: Bharath Mundlapudi > Assignee: Ravi Gummadi > Fix For: 0.20.204.0 > > Attachments: MR-2413.v0.1.patch, MR-2413.v0.patch > > > At present, TaskTracker doesn't handle disk failures properly both at startup and runtime. > (1) Currently TaskTracker doesn't come up if any of the mapred-local-dirs is on a bad disk. TaskTracker should ignore that particular mapred-local-dir and start up and use only the remaining good mapred-local-dirs. > (2) If a disk goes bad while TaskTracker is running, currently TaskTracker doesn't do anything special. This results in either > (a) TaskTracker continues to "try to use that bad disk" and this results in lots of task failures and possibly job failures(because of multiple TTs having bad disks) and eventually these TTs getting graylisted for all jobs. And this needs manual restart of TT with modified configuration of mapred-local-dirs avoiding the bad disk. OR > (b) Health check script identifying the disk as bad and the TT gets blacklisted. And this also needs manual restart of TT with modified configuration of mapred-local-dirs avoiding the bad disk. > This JIRA is to make TaskTracker more fault-tolerant to disk failures solving (1) and (2). i.e. TT should start even if at least one of the mapred-local-dirs is on a good disk and TT should adjust its in-memory list of mapred-local-dirs and avoid using bad mapred-local-dirs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira