Return-Path: Delivered-To: apmail-hadoop-mapreduce-dev-archive@minotaur.apache.org Received: (qmail 88633 invoked from network); 25 Jun 2010 05:52:18 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 25 Jun 2010 05:52:18 -0000 Received: (qmail 99552 invoked by uid 500); 25 Jun 2010 05:52:18 -0000 Delivered-To: apmail-hadoop-mapreduce-dev-archive@hadoop.apache.org Received: (qmail 99252 invoked by uid 500); 25 Jun 2010 05:52:14 -0000 Mailing-List: contact mapreduce-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-dev@hadoop.apache.org Delivered-To: mailing list mapreduce-dev@hadoop.apache.org Received: (qmail 99243 invoked by uid 99); 25 Jun 2010 05:52:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Jun 2010 05:52:13 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Jun 2010 05:52:11 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o5P5poWq023777 for ; Fri, 25 Jun 2010 05:51:50 GMT Message-ID: <25287604.53061277445110039.JavaMail.jira@thor> Date: Fri, 25 Jun 2010 01:51:50 -0400 (EDT) From: "Amareshwari Sriramadasu (JIRA)" To: mapreduce-dev@hadoop.apache.org Subject: [jira] Created: (MAPREDUCE-1895) MapEventFetcherThread should not iterate over jobs that are not localized MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org MapEventFetcherThread should not iterate over jobs that are not localized ------------------------------------------------------------------------- Key: MAPREDUCE-1895 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1895 Project: Hadoop Map/Reduce Issue Type: Bug Components: tasktracker Reporter: Amareshwari Sriramadasu We have seen a scenario of lost trackers on our clusters because of the following: TaskLauncher has locked a TaskTracker$RunningJob and doing localizeJob, which involves DFS operations. Map-event fetcher has locked TaskTracker.runningJobs map and is waiting to lock the RunningJob object. TaskTracker offerService is waiting to lock TaskTracker.runningJobs map, thus failing to send heartbeats in 10 minutes. So, I think map-event fetcher should circuit jobs that are not localized. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.