Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 83057 invoked from network); 10 Apr 2009 03:51:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Apr 2009 03:51:39 -0000 Received: (qmail 18971 invoked by uid 500); 10 Apr 2009 03:51:38 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 18898 invoked by uid 500); 10 Apr 2009 03:51:38 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 18888 invoked by uid 99); 10 Apr 2009 03:51:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Apr 2009 03:51:38 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Apr 2009 03:51:36 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id F2E21234C051 for ; Thu, 9 Apr 2009 20:51:12 -0700 (PDT) Message-ID: <1973917847.1239335472980.JavaMail.jira@brutus> Date: Thu, 9 Apr 2009 20:51:12 -0700 (PDT) From: "Devaraj Das (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-2141) speculative execution start up condition based on completion time MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697724#action_12697724 ] Devaraj Das commented on HADOOP-2141: ------------------------------------- Andy, the current patch doesn't apply unless the the fuzz factor is set to 3 - "patch -p0 -F 3 < HADOOP-2141-v6.patch". There is a NPE, and you can reproduce that if you run the test TestMiniMRDFSSort - "ant -Dtestcase=TestMiniMRDFSSort test -Dtest.output=yes", in the heartbeat method and the test never comes out since the TTs continues to resend the heartbeat forever. The NPE comes from isSlowTracker method. Looking more closely at the isSlowTracker method, i think that requires some rework. The isSlowTracker method currently looks at progress rates of only the running TIPs (although you do a check for TaskStatus.State.SUCCEEDED but this would be always false for RUNNING tips, and that is what is passed to the method) and attaches that to the TaskTrackers that are running them. But wouldn't you want to look at the history, i.e., successful TIPs that ran on the TaskTrackers. I am thinking that it would make sense to give one credit to a TT upon running a task successfully and base isSlowTracker purely on that (rather than the running tasks).. That way, even the TT's progress can be maintained inline and you wouldn't have to iterate over the running TIPs and compute that upon a TT heartbeat.. Thoughts? > speculative execution start up condition based on completion time > ----------------------------------------------------------------- > > Key: HADOOP-2141 > URL: https://issues.apache.org/jira/browse/HADOOP-2141 > Project: Hadoop Core > Issue Type: Improvement > Components: mapred > Affects Versions: 0.21.0 > Reporter: Koji Noguchi > Assignee: Andy Konwinski > Attachments: 2141.patch, HADOOP-2141-v2.patch, HADOOP-2141-v3.patch, HADOOP-2141-v4.patch, HADOOP-2141-v5.patch, HADOOP-2141-v6.patch, HADOOP-2141.patch > > > We had one job with speculative execution hang. > 4 reduce tasks were stuck with 95% completion because of a bad disk. > Devaraj pointed out > bq . One of the conditions that must be met for launching a speculative instance of a task is that it must be at least 20% behind the average progress, and this is not true here. > It would be nice if speculative execution also starts up when tasks stop making progress. > Devaraj suggested > bq. Maybe, we should introduce a condition for average completion time for tasks in the speculative execution check. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.