hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1431) Map tasks can't timeout for failing to call progress
Date Fri, 25 May 2007 16:42:16 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499152

Devaraj Das commented on HADOOP-1431:

The main requirement we are after in this issue is that we need to allow sort to report progress.
From the architecture point of view, I think it makes sense to have at least the MapReduce
kernel part of sort aware of that - i.e., the generic BufferSorter. 
My major objection to this patch is that we are kind of short circuiting things making the
thing look hacky IMO. I would much rather do it the following way:
1) Add a method to the BufferSorter interface called setReporter(Reporter).
2) Implementors of the interface, in this case the BasicTypeSorterBase would implement the
method, and in this case would just store the reporter object. This is similar to the BufferSorter.setInputBuffer
3) The BasicTypeSorterBase would periodically invoke the reporter.progress() to report progress.
The compare method in the BasicTypeSorterBase class is a potential place where reporter.progress
can be called.
This way, we don't make the sort library (currently the MergeSorter, MergeSort classes) aware
of the Reporter object but have everything in the MapReduce kernel. This preserves the boundaries
that i originally intended to have between the various layers (HADOOP-331).

For the reduceTask, we have threads for reporting progress for two phases:
1) during the shuffle (and here we implicitly do the progress reporting for the ramfs merges
2) during the merge of the on-disk files in the reduce phase
The thread for the first case is still there in the current patch. If we are to really remove
the issue, we should ideally remove the thread for the shuffle also since the ramfs merge
might also get stuck (since user code is involved there). 

Similarly to BufferSorter, we could have an API for merge that takes a Reporter object and
calls reporter.progress periodically. ReduceTask as well as the final merge on the MapTask
could use that for the merges. Again, the argument here is that we do expect merge to report
us progress and hence we enable it to do so.

> Map tasks can't timeout for failing to call progress
> ----------------------------------------------------
>                 Key: HADOOP-1431
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1431
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0
>            Reporter: Owen O'Malley
>         Assigned To: Arun C Murthy
>             Fix For: 0.13.0
>         Attachments: HADOOP-1431_1_20070525.patch
> Currently the map task runner creates a thread that calls progress every second to keep
the system from killing the map if the sort takes too long. This is the wrong approach, because
it will cause stuck tasks to not be killed. The right solution is to have the sort call progress
as it actually makes progress. This is part of what is going on in HADOOP-1374. A map gets
stuck at 100% progress, but not done.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message