kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stanislav Kozlovski (JIRA)" <j...@apache.org>
Subject [jira] [Created] (KAFKA-7790) Trogdor - Does not time out tasks in time
Date Mon, 07 Jan 2019 13:11:00 GMT
Stanislav Kozlovski created KAFKA-7790:
------------------------------------------

             Summary: Trogdor - Does not time out tasks in time
                 Key: KAFKA-7790
                 URL: https://issues.apache.org/jira/browse/KAFKA-7790
             Project: Kafka
          Issue Type: Improvement
            Reporter: Stanislav Kozlovski
            Assignee: Stanislav Kozlovski


All Trogdor task specifications have a defined `startMs` and `durationMs`. Under conditions
of task failure and restarts, it is intuitive to assume that a task would not be re-ran after
a certain time period.

Let's best illustrate the issue with an example:
{code:java}
startMs = 12PM; durationMs = 1hour;
# 12:02 - Coordinator schedules a task to run on agent-0
# 12:45 - agent-0 process dies. Coordinator's heartbeats to agent-0 fail.
# 12:47 - agent-0 comes back up. Coordinator's heartbeats pass and it re-schedules tasks that
are not running in agent-0
# 13:20 - agent-0 process dies.
# 13:22 - agent-0 comes back up. Coordinator re-schedules task{code}
This can result in an endless loop of task rescheduling. If there are more tasks scheduled
on agent-0 (e.g a task scheduled to start each on hour), we can end up in a scenario where
we overwhelm the agent with tasks that we would rather have dropped.
h2. Changes


We propose that the Trogdor Coordinator does not re-schedule a task if the current time of
re-scheduling is greater than the start time of the task and its duration combined. More specifically:
{code:java}
if (currentTimeMs < startTimeMs + durationTimeMs)
  scheduleTask()
else
  failTask(){code}
 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message