[ https://issues.apache.org/jira/browse/AURORA-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643199#comment-14643199
]
Maxim Khutornenko commented on AURORA-1404:
-------------------------------------------
The response time for stuck ASSIGNED tasks can be improved via AURORA-1370. I think it's generally
more robust to kill/reschedule an ASSIGNED task instead of retrying a {{launchTasks}} call
for something that's already in-flight.
> Reconcile ASSIGNED tasks that have not transitioned to STARTING
> ---------------------------------------------------------------
>
> Key: AURORA-1404
> URL: https://issues.apache.org/jira/browse/AURORA-1404
> Project: Aurora
> Issue Type: Task
> Components: Scheduler
> Reporter: Joshua Cohen
>
> If the Mesos master fails over between the time that Aurora moves a task to {{ASSIGNED}}
but before the slave receives the message, those tasks will never transition and eventually
be timed out by [TaskTimeout|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/async/TaskTimeout.java].
> Instead it would be better if we had a mechanism similar to [KillRetry|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/async/KillRetry.java]
that ensures assigned tasks have transitioned to a running state, and if not transitions them
to {{LOST}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
|