Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3739FDD22 for ; Wed, 17 Oct 2012 16:14:04 +0000 (UTC) Received: (qmail 17025 invoked by uid 500); 17 Oct 2012 16:14:04 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 16968 invoked by uid 500); 17 Oct 2012 16:14:04 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 16958 invoked by uid 99); 17 Oct 2012 16:14:03 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Oct 2012 16:14:03 +0000 Date: Wed, 17 Oct 2012 16:14:03 +0000 (UTC) From: "Nathan Roberts (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <332266294.58354.1350490443977.JavaMail.jiratomcat@arcas> Subject: [jira] [Created] (MAPREDUCE-4728) Interaction between oob heartbeats and damper can cause TT to heartbeat with zero delay MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Nathan Roberts created MAPREDUCE-4728: ----------------------------------------- Summary: Interaction between oob heartbeats and damper can cause TT to heartbeat with zero delay Key: MAPREDUCE-4728 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4728 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 1.0.3 Reporter: Nathan Roberts When mapreduce.tasktracker.outofband.heartbeat is true and mapreduce.tasktracker.outofband.heartbeat.damper is something largish (like the default of 1000000), the TT doesn't wait for tasks to finish before heartbeating back to the JT. This causes excessive load on the JT which in-turn reduces overall cluster performance. I believe the problem is that in the following block of code, when getHeartbeatInterval() returns 0, we heartbeat back immediately BUT finishedCount does not get reset. It looks like nothing ever gets us out of this situation so we basically heartbeat without ever sleeping. {code} // accelerate to account for multiple finished tasks up-front long remaining = (lastHeartbeat + getHeartbeatInterval(finishedCount.get())) - now; while (remaining > 0) { // sleeps for the wait time or // until there are *enough* empty slots to schedule tasks synchronized (finishedCount) { finishedCount.wait(remaining); // Recompute now = System.currentTimeMillis(); remaining = (lastHeartbeat + getHeartbeatInterval(finishedCount.get())) - now; if (remaining <= 0) { // Reset count finishedCount.set(0); break; } } } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira