Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 80717 invoked from network); 13 May 2009 05:57:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 May 2009 05:57:11 -0000 Received: (qmail 1022 invoked by uid 500); 13 May 2009 05:57:10 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 946 invoked by uid 500); 13 May 2009 05:57:10 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 931 invoked by uid 99); 13 May 2009 05:57:10 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 May 2009 05:57:10 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 May 2009 05:57:06 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 9F59F234C041 for ; Tue, 12 May 2009 22:56:45 -0700 (PDT) Message-ID: <312270172.1242194205651.JavaMail.jira@brutus> Date: Tue, 12 May 2009 22:56:45 -0700 (PDT) From: "dhruba borthakur (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Resolved: (HADOOP-5182) Task processes not exiting due to ackQueue bug in DFSClient In-Reply-To: <959024065.1233887279613.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur resolved HADOOP-5182. -------------------------------------- Resolution: Duplicate I am closing this one because I think this is a duplicate of HADOOP-3998. Please re-open if you think otherwise. > Task processes not exiting due to ackQueue bug in DFSClient > ----------------------------------------------------------- > > Key: HADOOP-5182 > URL: https://issues.apache.org/jira/browse/HADOOP-5182 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Affects Versions: 0.20.0, 0.21.0 > Environment: EC2 with Ubuntu Linux 2.6.16 AMI. Hadoop trunk revision 734915 > Reporter: Andy Konwinski > Attachments: jstack_ackqueue_bug > > > I was running some gridmix tests on a 10 node cluster on EC2 and ran into an issue with unmodified Hadoop trunk revision (SVN revision# 734915). After running gridmix multiple times, I noticed several mapreduce jobs stuck in the running state. They remained in that hung state for several days, while other gridmixes of that size finished in approximately 8 hours, and are still in that state actually. > I saw that the slave nodes had a bunch of hung task processes running, for tasks that the JobTracker log said were completed. These were hanging because the SIGTERM handler was waiting on DFSClient to close existing streams, but this was never finishing because DFSOutputStream waits on an ackQueue from the datanodes that was apparently getting no acks. The tasks did finish their work, but the processes hung around. > I'll attached a sample jstack trace - note how the SIGTERM handler is blocked on "thread-5", which is waiting for a monitor on the DFSOutputStream, but this stream's monitor is held by main, which is trying to flush the stream (last trace in the file). > Has anyone seen this issue before? > Another thing I saw was cleanup tasks never running (they are stuck in the initializing state on the web UI and can't be seen as running processes on the nodes). Not sure if that is actually related. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.