Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C9782732E for ; Fri, 22 Jul 2011 22:29:32 +0000 (UTC) Received: (qmail 84796 invoked by uid 500); 22 Jul 2011 22:29:32 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 84742 invoked by uid 500); 22 Jul 2011 22:29:32 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 84731 invoked by uid 99); 22 Jul 2011 22:29:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Jul 2011 22:29:31 +0000 X-ASF-Spam-Status: No, hits=-2001.1 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Jul 2011 22:29:30 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 5B43A36F91 for ; Fri, 22 Jul 2011 22:29:10 +0000 (UTC) Date: Fri, 22 Jul 2011 22:29:10 +0000 (UTC) From: "Eli Collins (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <359794346.202.1311373750370.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1394813331.25330.1301597106170.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (MAPREDUCE-2413) TaskTracker should handle disk failures at both startup and runtime MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069815#comment-13069815 ] Eli Collins commented on MAPREDUCE-2413: ---------------------------------------- Heads up, per this thread on mr-dev [1] this may be a wasted effort. http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201107.mbox/%3CCAPn_vTsdiiqfCB2G0HfsOr3W_4PKoocPcTf2VB93Y3MZrzRczQ@mail.gmail.com%3E > TaskTracker should handle disk failures at both startup and runtime > ------------------------------------------------------------------- > > Key: MAPREDUCE-2413 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2413 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task-controller, tasktracker > Affects Versions: 0.20.204.0 > Reporter: Bharath Mundlapudi > Assignee: Ravi Gummadi > Fix For: 0.20.204.0 > > Attachments: MR-2413.v0.1.patch, MR-2413.v0.2.patch, MR-2413.v0.3.patch, MR-2413.v0.patch > > > At present, TaskTracker doesn't handle disk failures properly both at startup and runtime. > (1) Currently TaskTracker doesn't come up if any of the mapred-local-dirs is on a bad disk. TaskTracker should ignore that particular mapred-local-dir and start up and use only the remaining good mapred-local-dirs. > (2) If a disk goes bad while TaskTracker is running, currently TaskTracker doesn't do anything special. This results in either > (a) TaskTracker continues to "try to use that bad disk" and this results in lots of task failures and possibly job failures(because of multiple TTs having bad disks) and eventually these TTs getting graylisted for all jobs. And this needs manual restart of TT with modified configuration of mapred-local-dirs avoiding the bad disk. OR > (b) Health check script identifying the disk as bad and the TT gets blacklisted. And this also needs manual restart of TT with modified configuration of mapred-local-dirs avoiding the bad disk. > This JIRA is to make TaskTracker more fault-tolerant to disk failures solving (1) and (2). i.e. TT should start even if at least one of the mapred-local-dirs is on a good disk and TT should adjust its in-memory list of mapred-local-dirs and avoid using bad mapred-local-dirs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira