hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vivek Ratan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1431) Map tasks can't timeout for failing to call progress
Date Tue, 29 May 2007 12:57:15 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499786
] 

Vivek Ratan commented on HADOOP-1431:
-------------------------------------

As part of a good solution (for 0.14 or later), I think we should separate out reporting of
progress by the sort/merge/user code and reporting progress from the Task to the Task Tracker.


For the former, we make the Reporter object available to the MapReduce kernel code, as Devaraj
suggested, and at other appropriate places as discussed in this conversation. Wherever progress
is made that we need to report (during sort or merge or whatever), the kernel code or the
user's code calls the Reporter project. 

Separately, for the latter, we probably should continue with the Progress thread. This thread
looks at the Progress data structures and sends progress info to the TaskTracker via RPC.
To avoid the problem that this bug was filed for, we have two likely options: 
1. The thread continuus doing what it is doing is: it sends the progress information at regular
intervals and the TaskTracker decides whether the task has really made progress, based on
what it got earlier. Or
2. The thread decides whether progress has really been made and makes an RPC call only if
necessary. Even if progress is not made, it may make a call if we eliminate the Ping thread
(see issue 1201) to prevent the TaskTracker from killing the task. 

The latter's probably a better option as the logic to decide whether progress has been made
may be easier to implement in the thread, rather than in TaskTracker. As discussed earlier
in this conversation, we may resume/suspend the thread, or at least make sure we start and
stop it at the right places But I'd suggest we separate the issue of reporting progress locally
(via the Reporter object) with reporting progress to the TaskTracker (via a thread). The logic
for the two issues is diferent and separating the code will make things cleaner and easier
to change. 

> Map tasks can't timeout for failing to call progress
> ----------------------------------------------------
>
>                 Key: HADOOP-1431
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1431
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0
>            Reporter: Owen O'Malley
>            Assignee: Arun C Murthy
>             Fix For: 0.13.0
>
>         Attachments: HADOOP-1431_1_20070525.patch
>
>
> Currently the map task runner creates a thread that calls progress every second to keep
the system from killing the map if the sort takes too long. This is the wrong approach, because
it will cause stuck tasks to not be killed. The right solution is to have the sort call progress
as it actually makes progress. This is part of what is going on in HADOOP-1374. A map gets
stuck at 100% progress, but not done.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message