Return-Path: Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: (qmail 80212 invoked from network); 11 Nov 2009 02:41:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 Nov 2009 02:41:51 -0000 Received: (qmail 1251 invoked by uid 500); 11 Nov 2009 02:41:51 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 1112 invoked by uid 500); 11 Nov 2009 02:41:50 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 1084 invoked by uid 99); 11 Nov 2009 02:41:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Nov 2009 02:41:50 +0000 X-ASF-Spam-Status: No, hits=-10.5 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_HI X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Nov 2009 02:41:48 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id F10FA234C1EF for ; Tue, 10 Nov 2009 18:41:27 -0800 (PST) Message-ID: <595409617.1257907287972.JavaMail.jira@brutus> Date: Wed, 11 Nov 2009 02:41:27 +0000 (UTC) From: "Aaron Kimball (JIRA)" To: mapreduce-issues@hadoop.apache.org Subject: [jira] Updated: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing In-Reply-To: <44290688.1255733731270.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-1119: ------------------------------------- Attachment: MAPREDUCE-1119.3.patch Attaching new patch to address code review issues. * Renamed {{QUIT_TASK_JVM}} to {{SIGQUIT_TASK_JVM}} * Task timeout causes {{SIGQUIT}}; other task kill events do not. ** modified the various calls in the call chain for task kill to pass along a {{wasFailure}} bit ** modified all associated call-sites to forward along existing {{wasFailure}} bit, or generate a new {{true}} or {{false}} as appropriate. ** modified TestJobKillAndFail to distinguish between job kill and task timeout failure conditions and whether or not those deserved stack dumps. * If a SIGQUIT is issued before a SIGKILL, SIGTERM is not. * Refactored common code in ProcessTree > When tasks fail to report status, show tasks's stack dump before killing > ------------------------------------------------------------------------ > > Key: MAPREDUCE-1119 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: tasktracker > Affects Versions: 0.22.0 > Reporter: Todd Lipcon > Assignee: Aaron Kimball > Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, MAPREDUCE-1119.patch > > > When the TT kills tasks that haven't reported status, it should somehow gather a stack dump for the task. This could be done either by sending a SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to gather the stack directly from Java. This may be somewhat tricky since the child may be running as another user (so the SIGQUIT would have to go through LinuxTaskController). This feature would make debugging these kinds of failures much easier, especially if we could somehow get it into the TaskDiagnostic message -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.