Return-Path: Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: (qmail 33526 invoked from network); 30 Aug 2010 06:49:39 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 30 Aug 2010 06:49:39 -0000 Received: (qmail 79720 invoked by uid 500); 30 Aug 2010 06:49:39 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 79628 invoked by uid 500); 30 Aug 2010 06:49:37 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 79609 invoked by uid 99); 30 Aug 2010 06:49:36 -0000 Received: from Unknown (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Aug 2010 06:49:36 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Aug 2010 06:49:19 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o7U6mwJQ004004 for ; Mon, 30 Aug 2010 06:48:58 GMT Message-ID: <21457799.63821283150937989.JavaMail.jira@thor> Date: Mon, 30 Aug 2010 02:48:57 -0400 (EDT) From: "Liyin Liang (JIRA)" To: mapreduce-issues@hadoop.apache.org Subject: [jira] Commented: (MAPREDUCE-1247) Send out-of-band heartbeat to avoid fake lost tasktracker In-Reply-To: <1048862769.1259565323659.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAPREDUCE-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904091#action_12904091 ] Liyin Liang commented on MAPREDUCE-1247: ---------------------------------------- Hi Guanyin, our product cluster met the same problem. Would you please attach your patch file? tks. > Send out-of-band heartbeat to avoid fake lost tasktracker > --------------------------------------------------------- > > Key: MAPREDUCE-1247 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1247 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Reporter: ZhuGuanyin > Assignee: ZhuGuanyin > > Currently the TaskTracker report task status to jobtracker through heartbeat, sometimes if the tasktracker lock the tasktracker to do some cleanup job, like remove task temp data on disk, the heartbeat thread would hang for a long time while waiting for the lock, so the jobtracker just thought it had lost and would reschedule all its finished maps or un finished reduce on other tasktrackers, we call it "fake lost tasktracker", some times it doesn't acceptable especially when we run some large jobs. So We introduce a out-of-band heartbeat mechanism to send an out-of-band heartbeat in that case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.