Return-Path: Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: (qmail 26383 invoked from network); 2 Dec 2009 08:42:44 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Dec 2009 08:42:44 -0000 Received: (qmail 56789 invoked by uid 500); 2 Dec 2009 08:42:44 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 56713 invoked by uid 500); 2 Dec 2009 08:42:44 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 56698 invoked by uid 99); 2 Dec 2009 08:42:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Dec 2009 08:42:44 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Dec 2009 08:42:41 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id C630D234C045 for ; Wed, 2 Dec 2009 00:42:20 -0800 (PST) Message-ID: <1319597136.1259743340807.JavaMail.jira@brutus> Date: Wed, 2 Dec 2009 08:42:20 +0000 (UTC) From: "Vinod K V (JIRA)" To: mapreduce-issues@hadoop.apache.org Subject: [jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing In-Reply-To: <44290688.1255733731270.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784707#action_12784707 ] Vinod K V commented on MAPREDUCE-1119: -------------------------------------- bq. I am not very sure we want to use the same configuration property for sleeping after dump-stack bq. So I think the delay is necessary. As for adding a new config item.. maybe wait til someone makes an issue of this? Is there any performance implication here? Given that the delay is necessary, it is only about the config item - new or the same one. Thinking a bit about it, i am fine with keeping the same configuration because that kind of gives the same time for thread-dump as the time that is given for the process to clean itself up. I know it isn't a very concrete reason but it works for me for now :) Gone through your latest patch. It looks good now. +1 from my side. I'll run it through Hudson once again to be sure about the test-cases. May be {{TestGridmixSubmission}} and {{TestJobHistory.testDoneFolderOnHDFS()}} failures are timing related. After Hudson blesses, can you ask someone to commit this? Thanks for being patient throughout! > When tasks fail to report status, show tasks's stack dump before killing > ------------------------------------------------------------------------ > > Key: MAPREDUCE-1119 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: tasktracker > Affects Versions: 0.22.0 > Reporter: Todd Lipcon > Assignee: Aaron Kimball > Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, MAPREDUCE-1119.4.patch, MAPREDUCE-1119.5.patch, MAPREDUCE-1119.6.patch, MAPREDUCE-1119.patch > > > When the TT kills tasks that haven't reported status, it should somehow gather a stack dump for the task. This could be done either by sending a SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to gather the stack directly from Java. This may be somewhat tricky since the child may be running as another user (so the SIGQUIT would have to go through LinuxTaskController). This feature would make debugging these kinds of failures much easier, especially if we could somehow get it into the TaskDiagnostic message -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.