Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D40C510647 for ; Tue, 1 Apr 2014 20:52:43 +0000 (UTC) Received: (qmail 6706 invoked by uid 500); 1 Apr 2014 20:52:39 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 6594 invoked by uid 500); 1 Apr 2014 20:52:36 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 6398 invoked by uid 99); 1 Apr 2014 20:52:31 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Apr 2014 20:52:31 +0000 Date: Tue, 1 Apr 2014 20:52:31 +0000 (UTC) From: "Sangjin Lee (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAPREDUCE-5817) mappers get rescheduled on node transition even after all reducers are completed MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-5817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956998#comment-13956998 ] Sangjin Lee commented on MAPREDUCE-5817: ---------------------------------------- The test failures are unrelated to this patch. They are coming from MAPREDUCE-5815. > mappers get rescheduled on node transition even after all reducers are completed > -------------------------------------------------------------------------------- > > Key: MAPREDUCE-5817 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5817 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster > Affects Versions: 2.3.0 > Reporter: Sangjin Lee > Assignee: Sangjin Lee > Attachments: mapreduce-5817.patch > > > We're seeing a behavior where a job runs long after all reducers were already finished. We found that the job was rescheduling and running a number of mappers beyond the point of reducer completion. In one situation, the job ran for some 9 more hours after all reducers completed! > This happens because whenever a node transition (to an unusable state) comes into the app master, it just reschedules all mappers that already ran on the node in all cases. > Therefore, if any node transition has a potential to extend the job period. Once this window opens, another node transition can prolong it, and this can happen indefinitely in theory. > If there is some instability in the pool (unhealthy, etc.) for a duration, then any big job is severely vulnerable to this problem. > If all reducers have been completed, JobImpl.actOnUnusableNode() should not reschedule mapper tasks. If all reducers are completed, the mapper outputs are no longer needed, and there is no need to reschedule mapper tasks as they would not be consumed anyway. -- This message was sent by Atlassian JIRA (v6.2#6252)