Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 63229 invoked from network); 30 Nov 2008 05:00:41 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Nov 2008 05:00:41 -0000 Received: (qmail 99916 invoked by uid 500); 30 Nov 2008 05:00:46 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 99875 invoked by uid 500); 30 Nov 2008 05:00:46 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 99864 invoked by uid 99); 30 Nov 2008 05:00:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Nov 2008 21:00:46 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 30 Nov 2008 04:59:27 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 39AB1234C2AA for ; Sat, 29 Nov 2008 20:59:44 -0800 (PST) Message-ID: <1579847708.1228021184233.JavaMail.jira@brutus> Date: Sat, 29 Nov 2008 20:59:44 -0800 (PST) From: "Chris Douglas (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-4714) map tasks timing out during merge phase In-Reply-To: <519168781.1227508784154.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-4714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651787#action_12651787 ] Chris Douglas commented on HADOOP-4714: --------------------------------------- bq. Although this would solve the issue for our particular case, I can imagine a situation (e.g. single reducer with highly aggregated huge records) where this would not help, i.e. the time component needs to be factored into the progress reporting. Progress should always be reported at smaller intervals than the timeout which is configurable and could be a small number. Fair point. Still, it's far from the only profile that makes unobserved progress and the current approach isn't guaranteed to work in this hypothetical case, either. Balancing performance against accuracy in the current model admits the possibility of spurious timeouts by definition. The heuristics we use- every N records out of the merge, every partition, etc.- are, as you point out, not guaranteed to fit each job's profile, but they're usually good enough without being too expensive. Continuing to refine them is a stop-gap, admittedly. The TaskTracker could try to look for external signs of progress in tasks it suspects are stuck (e.g. spills/merges generated between heartbeats), but that mistakes MapTask side-effects for task health. Adding a thread to poll for progress within the task adds modest benefits, but the cost in complexity (and likely performance) is discouraging. It also separates the checks for progress from the code effecting it, making it harder to maintain. While it's possible to imagine an adaptive system that would sample the frequency of status updates generated from each component and tune each threshold to the particular job, that's a long, long way from what is currently in place. For the scope of this JIRA, I think it's sufficient to flag progress after each partition. Doing more reporting out of the merge would be OK, but I see no reason to enshrine the N records/progress update heuristic. > map tasks timing out during merge phase > --------------------------------------- > > Key: HADOOP-4714 > URL: https://issues.apache.org/jira/browse/HADOOP-4714 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.18.1 > Reporter: Christian Kunz > Assignee: Jothi Padmanabhan > Attachments: hadoop-4714.patch > > > With compression of transient data turned on some parts of the merge phase seem to not report progress enough. > We see a lot of task failures during the merge phase, most of them timing out (even with a 20 min timeout) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.