hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4085) Kill task attempts longer than a configured queue max time
Date Fri, 30 Mar 2012 02:39:26 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242014#comment-13242014

Allen Wittenauer commented on MAPREDUCE-4085:

I've talked with a few folks over this idea and even have a somewhat hacky patch that "works"....
but it is not in a sharable state.  

The way I've written it is that the TaskTracker has code similar to the progress checker that
compares task attempt run time vs. mapred.queue.(queuename).task-time-limit.  If it is over
this time, it kills the attempt, logs an error and moves on.

The problem comes in the form of how to get the queue time limit to the TaskTracker in a scalable,
secure way. The TaskTracker appears to read the uploaded jobconf directly without it going
through any modification.  The JobTracker appears to do most of the job vetting prior to scheduling
the tasks.  Since this is a queue variable, the JT should ideally be the one that 'owns' the
variable.  This also allows for easy mradmin refresh functionality.

Some of the ideas that have been bounced around:

* JT rewrites the JobConf file prior to scheduling
* TT opens a connection to a jsp on the JT, fetches the info, stores it into a local TT cache
* TT uses a InterTrackerProtocol to ask the JT, stores it into a local TT cache w/TTL
* JT passes the info along with the heartbeat response

I'd like to have some discussion on some of these ideas to see what folks think is the most
> Kill task attempts longer than a configured queue max time
> ----------------------------------------------------------
>                 Key: MAPREDUCE-4085
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4085
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: task
>            Reporter: Allen Wittenauer
> For some environments, it is desirable to have certain queues have an SLA with regards
to task turnover.  (i.e., a slot will be free in X minutes and scheduled to the appropriate
job)  Queues should have a 'task time limit' that would cause task attempts over this time
to be killed. This leaves open the possibility that if the task was on a bad node, it could
still be rescheduled up to max.task.attempt times.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message