Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 60940 invoked from network); 1 Oct 2007 10:32:12 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 1 Oct 2007 10:32:12 -0000 Received: (qmail 13213 invoked by uid 500); 1 Oct 2007 10:32:01 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 13176 invoked by uid 500); 1 Oct 2007 10:32:01 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 13167 invoked by uid 99); 1 Oct 2007 10:32:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Oct 2007 03:32:01 -0700 X-ASF-Spam-Status: No, hits=-98.8 required=10.0 tests=ALL_TRUSTED,DNS_FROM_DOB,RCVD_IN_DOB X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Oct 2007 10:32:11 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 9E486714209 for ; Mon, 1 Oct 2007 03:31:50 -0700 (PDT) Message-ID: <23137139.1191234710635.JavaMail.jira@brutus> Date: Mon, 1 Oct 2007 03:31:50 -0700 (PDT) From: "Vivek Ratan (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Updated: (HADOOP-1970) tasktracker hang in reduce. Deadlock between main and comm thread In-Reply-To: <29258945.1191049673461.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Ratan updated HADOOP-1970: -------------------------------- Attachment: 1970_patch02 Since this is a fix for 0.14.2, and I shouldn't be adding any new functionality, I've only fixed the problem with the deadlock in 1970_patch02. In a separate issues (HADOOP-1974), I have detailed the larger solution, including caching the root node. > tasktracker hang in reduce. Deadlock between main and comm thread > ----------------------------------------------------------------- > > Key: HADOOP-1970 > URL: https://issues.apache.org/jira/browse/HADOOP-1970 > Project: Hadoop > Issue Type: Bug > Components: mapred > Affects Versions: 0.14.1 > Reporter: Koji Noguchi > Assignee: Vivek Ratan > Priority: Blocker > Fix For: 0.14.2 > > Attachments: 1970_patch01, 1970_patch02 > > > Saw one reduce task stuck on copy. > jstack on the reduce task(task_200709272248_0001_r_000150_0) process showed > {noformat} > Found one Java-level deadlock: > ============================= > "Comm thread for task_200709272248_0001_r_000150_0": > waiting to lock monitor 0x08144020 (object 0xd4e30aa8, a org.apache.hadoop.util.Progress), > which is held by "main" > "main": > waiting to lock monitor 0x08144084 (object 0xd4e30958, a org.apache.hadoop.util.Progress), > which is held by "Comm thread for task_200709272248_0001_r_000150_0" > Java stack information for the threads listed above: > =================================================== > "Comm thread for task_200709272248_0001_r_000150_0": > at org.apache.hadoop.util.Progress.toString(Progress.java:113) > - waiting to lock <0xd4e30aa8> (a org.apache.hadoop.util.Progress) > at org.apache.hadoop.util.Progress.toString(Progress.java:116) > - locked <0xd4e30958> (a org.apache.hadoop.util.Progress) > at org.apache.hadoop.util.Progress.toString(Progress.java:108) > at org.apache.hadoop.mapred.Task$1.run(Task.java:268) > at java.lang.Thread.run(Thread.java:619) > "main": > at org.apache.hadoop.util.Progress.startNextPhase(Progress.java:58) > - waiting to lock <0xd4e30958> (a org.apache.hadoop.util.Progress) > at org.apache.hadoop.util.Progress.complete(Progress.java:70) > - locked <0xd4e30aa8> (a org.apache.hadoop.util.Progress) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:253) > at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1777) > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.