Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 99543 invoked from network); 29 May 2007 12:57:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 29 May 2007 12:57:37 -0000 Received: (qmail 23866 invoked by uid 500); 29 May 2007 12:57:41 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 23838 invoked by uid 500); 29 May 2007 12:57:40 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 23829 invoked by uid 99); 29 May 2007 12:57:40 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 May 2007 05:57:40 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 May 2007 05:57:36 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id C195671417E for ; Tue, 29 May 2007 05:57:15 -0700 (PDT) Message-ID: <2029873.1180443435786.JavaMail.jira@brutus> Date: Tue, 29 May 2007 05:57:15 -0700 (PDT) From: "Vivek Ratan (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-1431) Map tasks can't timeout for failing to call progress In-Reply-To: <13907972.1180066936133.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499786 ] Vivek Ratan commented on HADOOP-1431: ------------------------------------- As part of a good solution (for 0.14 or later), I think we should separate out reporting of progress by the sort/merge/user code and reporting progress from the Task to the Task Tracker. For the former, we make the Reporter object available to the MapReduce kernel code, as Devaraj suggested, and at other appropriate places as discussed in this conversation. Wherever progress is made that we need to report (during sort or merge or whatever), the kernel code or the user's code calls the Reporter project. Separately, for the latter, we probably should continue with the Progress thread. This thread looks at the Progress data structures and sends progress info to the TaskTracker via RPC. To avoid the problem that this bug was filed for, we have two likely options: 1. The thread continuus doing what it is doing is: it sends the progress information at regular intervals and the TaskTracker decides whether the task has really made progress, based on what it got earlier. Or 2. The thread decides whether progress has really been made and makes an RPC call only if necessary. Even if progress is not made, it may make a call if we eliminate the Ping thread (see issue 1201) to prevent the TaskTracker from killing the task. The latter's probably a better option as the logic to decide whether progress has been made may be easier to implement in the thread, rather than in TaskTracker. As discussed earlier in this conversation, we may resume/suspend the thread, or at least make sure we start and stop it at the right places But I'd suggest we separate the issue of reporting progress locally (via the Reporter object) with reporting progress to the TaskTracker (via a thread). The logic for the two issues is diferent and separating the code will make things cleaner and easier to change. > Map tasks can't timeout for failing to call progress > ---------------------------------------------------- > > Key: HADOOP-1431 > URL: https://issues.apache.org/jira/browse/HADOOP-1431 > Project: Hadoop > Issue Type: Bug > Components: mapred > Affects Versions: 0.13.0 > Reporter: Owen O'Malley > Assignee: Arun C Murthy > Fix For: 0.13.0 > > Attachments: HADOOP-1431_1_20070525.patch > > > Currently the map task runner creates a thread that calls progress every second to keep the system from killing the map if the sort takes too long. This is the wrong approach, because it will cause stuck tasks to not be killed. The right solution is to have the sort call progress as it actually makes progress. This is part of what is going on in HADOOP-1374. A map gets stuck at 100% progress, but not done. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.