aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Santhosh Kumar Shanmugham (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AURORA-1966) TASK_UNKNOWN to PARTITIONED mapping puts Scheduler to kill non-exist Task indefinitely
Date Wed, 24 Jan 2018 23:52:00 GMT
Santhosh Kumar Shanmugham created AURORA-1966:
-------------------------------------------------

             Summary: TASK_UNKNOWN to PARTITIONED mapping puts Scheduler to kill non-exist
Task indefinitely
                 Key: AURORA-1966
                 URL: https://issues.apache.org/jira/browse/AURORA-1966
             Project: Aurora
          Issue Type: Bug
            Reporter: Santhosh Kumar Shanmugham


When a Task launch fails, it is moved from ASSIGNED to LOST, which performs a RESCHEDULE and
KILL. Unfortunately the KILL of a non-existent task to the Mesos master results in a TASK_UNKNOWN
status update, which gets mapped to PARTITIONED. While the transition from LOST to PARTITIONED is
not allowed, some callbacks get executed despite the fact, resulting in a KILL and RESCHEDULE
action. This new KILL triggers another TASK_UNKNOWN and hence PARTITIONED status update for
the same task, putting the Scheduler to indefinitely attempt KILLing the non-existent task.
Attempting a client job killall results in the same state for the scheduler.

Since the scheduler uses the LOST state for black-holing task the {{TaskStateMachine}} needs to
take those into account.

I was able to reproduce this in the Vagrant image by faking a launch failure.
{code:java}
I0124 05:48:23.198 [qtp1791010542-40, StateMachine] vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
state machine transition INIT -> PENDING I0124 05:48:23.213508 9748 log.cpp:560] Attempting
to append 1679 bytes to the log I0124 05:48:23.214570 9748 coordinator.cpp:348] Coordinator
attempting to write APPEND action at position 24778 I0124 05:48:23.214834 9748 replica.cpp:540]
Replica received write request for position 24778 from __req_res__(4)@192.168.33.7:8083 I0124
05:48:23.221982 9748 leveldb.cpp:341] Persisting action (1700 bytes) to leveldb took 6.772102ms
I0124 05:48:23.222174 9748 replica.cpp:711] Persisted action APPEND at position 24778 I0124
05:48:23.222901 9748 replica.cpp:694] Replica received learned notice for position 24778 from
log-network(1)@192.168.33.7:8083 I0124 05:48:23.226833 9748 leveldb.cpp:341] Persisting action
(1702 bytes) to leveldb took 3.227779ms I0124 05:48:23.227008 9748 replica.cpp:711] Persisted
action APPEND at position 24778 I0124 05:48:23.262 [qtp1791010542-40, RequestLog] 127.0.0.1
- - [24/Jan/2018:05:48:23 +0000] "POST //aurora.local/api HTTP/1.1" 200 78 I0124 05:48:23.267
[qtp1791010542-40, LoggingInterceptor] getTasksWithoutConfigs(TaskQuery(role:null, environment:null,
jobName:null, taskIds:null, statuses:null, instanceIds:null, slaveHosts:null, jobKeys:[JobKey(role:vagrant,
environment:test, name:fail-partition_aware_disabled)], offset:0, limit:0)) I0124 05:48:23.285
[qtp1791010542-40, RequestLog] 127.0.0.1 - - [24/Jan/2018:05:48:23 +0000] "POST //aurora.local/api
HTTP/1.1" 200 794 I0124 05:48:23.349 [TaskGroupBatchWorker, StateMachine] Callback transition
PENDING to ASSIGNED, allow: true I0124 05:48:23.353 [TaskGroupBatchWorker, StateMachine] vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
state machine transition PENDING -> ASSIGNED I0124 05:48:23.356 [TaskGroupBatchWorker,
TaskAssignerImpl] Offer on agent 192.168.33.7 (id fe8bc641-aa02-4363-a990-318d20de1bac-S0)
is being assigned task for vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606.
W0124 05:48:23.445 [TaskGroupBatchWorker, TaskAssignerImpl] Failed to launch task. org.apache.aurora.scheduler.offers.OfferManager$LaunchException:
Failed to launch task. at org.apache.aurora.scheduler.offers.OfferManagerImpl.launchTask(OfferManagerImpl.java:212)
at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) at org.apache.aurora.scheduler.scheduling.TaskAssignerImpl.launchUsingOffer(TaskAssignerImpl.java:126)
at org.apache.aurora.scheduler.scheduling.TaskAssignerImpl.maybeAssign(TaskAssignerImpl.java:262)
at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) at org.apache.aurora.scheduler.scheduling.TaskSchedulerImpl.scheduleTasks(TaskSchedulerImpl.java:154)
at org.apache.aurora.scheduler.scheduling.TaskSchedulerImpl.schedule(TaskSchedulerImpl.java:108)
at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) at org.apache.aurora.scheduler.scheduling.TaskGroups$1.lambda$run$0(TaskGroups.java:174)
at org.apache.aurora.scheduler.BatchWorker$Work.apply(BatchWorker.java:117) at org.apache.aurora.scheduler.BatchWorker.lambda$processBatch$3(BatchWorker.java:210)
at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:144)
at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:139)
at org.apache.aurora.scheduler.storage.durability.DurableStorage.lambda$doInTransaction$0(DurableStorage.java:199)
at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:89) at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
at org.apache.aurora.scheduler.storage.durability.DurableStorage.doInTransaction(DurableStorage.java:198)
at org.apache.aurora.scheduler.storage.durability.DurableStorage.write(DurableStorage.java:221)
at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.write(CallOrderEnforcingStorage.java:132)
at org.apache.aurora.scheduler.BatchWorker.processBatch(BatchWorker.java:207) at org.apache.aurora.scheduler.BatchWorker.run(BatchWorker.java:199)
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
at com.google.common.util.concurrent.Callables$4.run(Callables.java:122) at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: mocking launch failure at org.apache.aurora.scheduler.offers.OfferManagerImpl.launchTask(OfferManagerImpl.java:201)
... 23 common frames omitted I0124 05:48:23.446 [TaskGroupBatchWorker, StateMachine] Callback
transition ASSIGNED to LOST, allow: true I0124 05:48:23.447 [TaskGroupBatchWorker, StateMachine]
vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 state machine
transition ASSIGNED -> LOST I0124 05:48:23.449 [TaskGroupBatchWorker, StateManagerImpl]
Task being rescheduled: vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
I0124 05:48:23.454 [TaskGroupBatchWorker, StateMachine] Callback transition INIT to PENDING,
allow: true I0124 05:48:23.456 [TaskGroupBatchWorker, StateMachine] vagrant-test-fail-partition_aware_disabled-0-a882eee4-b214-49e1-be9b-1b60d119ed60
state machine transition INIT -> PENDING I0124 05:48:23.473 [TaskGroupBatchWorker, StateManagerImpl]
Task being killed: vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
I0124 05:48:23.476 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Killing task vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
I0124 05:48:23.496659 9751 scheduler.cpp:243] Sending KILL call to http://192.168.33.7:5050/master/api/v1/scheduler
I0124 05:48:23.499577 9749 scheduler.cpp:676] Enqueuing event UPDATE received from http://192.168.33.7:5050/master/api/v1/scheduler
W0124 05:48:23.505 [AsyncProcessor-4, TaskVars] Failed to find rack attribute associated with
host 192.168.33.7 W0124 05:48:23.515 [AsyncProcessor-4, Stats] Re-using already registered
variable for key task_delivery_delay_SOURCE_MASTER_timeouts_per_sec W0124 05:48:23.515 [AsyncProcessor-4,
Stats] Re-using already registered variable for key task_delivery_delay_SOURCE_MASTER_requests_per_sec
I0124 05:48:23.524458 9754 log.cpp:560] Attempting to append 3625 bytes to the log I0124 05:48:23.524636
9754 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 24779
I0124 05:48:23.525111 9754 replica.cpp:540] Replica received write request for position 24779
from __req_res__(5)@192.168.33.7:8083 I0124 05:48:23.526832 9754 leveldb.cpp:341] Persisting
action (3646 bytes) to leveldb took 1.612374ms I0124 05:48:23.526927 9754 replica.cpp:711]
Persisted action APPEND at position 24779 I0124 05:48:23.527189 9754 replica.cpp:694] Replica
received learned notice for position 24779 from log-network(1)@192.168.33.7:8083 I0124 05:48:23.528949
9754 leveldb.cpp:341] Persisting action (3648 bytes) to leveldb took 1.543755ms I0124 05:48:23.529088
9754 replica.cpp:711] Persisted action APPEND at position 24779 I0124 05:48:23.538 [TaskStatusHandlerImpl,
TaskStatusHandlerImpl] Task:vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
moved to status:PARTITIONED for mesos state: TASK_UNKNOWN I0124 05:48:23.540 [TaskStatusHandlerImpl,
StateMachine] Callback transition LOST to PARTITIONED, allow: false I0124 05:48:23.540 [TaskStatusHandlerImpl,
StateMachine] vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
state machine transition LOST -> PARTITIONED (not allowed) I0124 05:48:23.540 [TaskStatusHandlerImpl,
TaskStateMachine] LOST to PARTITIONED => KILL E0124 05:48:23.540 [TaskStatusHandlerImpl,
TaskStateMachine] Illegal state transition attempted: LOST -> PARTITIONED (not allowed)
I0124 05:48:23.541 [TaskStatusHandlerImpl, StateManagerImpl] Task being killed: vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
I0124 05:48:23.542 [TaskStatusHandlerImpl, VersionedSchedulerDriverService] Killing task vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
I0124 05:48:23.543357 9749 scheduler.cpp:243] Sending KILL call to http://192.168.33.7:5050/master/api/v1/scheduler
W0124 05:48:23.546 [TaskGroupBatchWorker, TaskSchedulerImpl] Failed to look up tasks vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
I0124 05:48:23.587267 9751 scheduler.cpp:676] Enqueuing event UPDATE received from http://192.168.33.7:5050/master/api/v1/scheduler
I0124 05:48:23.593 [TaskStatusHandlerImpl, TaskStatusHandlerImpl] Task:vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
moved to status:PARTITIONED for mesos state: TASK_UNKNOWN I0124 05:48:23.595 [TaskStatusHandlerImpl,
StateMachine] Callback transition LOST to PARTITIONED, allow: false I0124 05:48:23.595 [TaskStatusHandlerImpl,
StateMachine] vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
state machine transition LOST -> PARTITIONED (not allowed) I0124 05:48:23.595 [TaskStatusHandlerImpl,
TaskStateMachine] LOST to PARTITIONED => KILL E0124 05:48:23.595 [TaskStatusHandlerImpl,
TaskStateMachine] Illegal state transition attempted: LOST -> PARTITIONED (not allowed)
I0124 05:48:23.595 [TaskStatusHandlerImpl, StateManagerImpl] Task being killed: vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
I0124 05:48:23.595 [TaskStatusHandlerImpl, VersionedSchedulerDriverService] Killing task vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
I0124 05:48:23.596256 9747 scheduler.cpp:243] Sending KILL call to http://192.168.33.7:5050/master/api/v1/scheduler
I0124 05:48:23.634858 9751 scheduler.cpp:676] Enqueuing event UPDATE received from http://192.168.33.7:5050/master/api/v1/scheduler
I0124 05:48:23.636 [TaskStatusHandlerImpl, TaskStatusHandlerImpl] Task:vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
moved to status:PARTITIONED for mesos state: TASK_UNKNOWN
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message