aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jordan Ly (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AURORA-1945) Rescinds received but not processed in time before offer accept
Date Mon, 21 Aug 2017 22:51:00 GMT
Jordan Ly created AURORA-1945:
---------------------------------

             Summary: Rescinds received but not processed in time before offer accept
                 Key: AURORA-1945
                 URL: https://issues.apache.org/jira/browse/AURORA-1945
             Project: Aurora
          Issue Type: Bug
          Components: Scheduler
            Reporter: Jordan Ly
            Assignee: Jordan Ly
            Priority: Minor


The current race condition for offers is possible:

# Scheduler receives an offer and adds it to the executor queue for processing.
# The executor processes the offer and adds it to the HostOffers list.
# Scheduler receives a rescind for that offer and adds it to the executor queue for processing.
However, there is a lot of load on the executor so there might be a delay between receiving
the rescind and processing it.
# Scheduler accepts the offer before the rescind is processed by the executor. This will result
in launching a task with an invalid offer leading to TASK_LOST.

The following logs show this in action:

Mesos:
{noformat}
I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is
no longer valid
W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]':
Offer OFFER_X is no longer valid
I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y
with invalid offers: Offer OFFER_X is no longer valid'
{noformat}

Aurora:
{noformat}
I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received
offer: OFFER_X 
I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer
OFFER_X with ops [LAUNCH] 
I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received
status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS:
Task launched with invalid offers: Offer_X is no longer valid 
I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer
rescinded: OFFER_X
W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer:
OFFER_X. 
{noformat}

We should find a way to prioritize/process rescinds immediately to avoid this delay. We should
also take into account the previous race condition fixed by [AURORA-1933|https://issues.apache.org/jira/browse/AURORA-1933]
so we do not repeat that as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message