aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bill Farner (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AURORA-1953) Scheduler livelock during startup
Date Fri, 27 Oct 2017 04:54:00 GMT

    [ https://issues.apache.org/jira/browse/AURORA-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16221710#comment-16221710
] 

Bill Farner commented on AURORA-1953:
-------------------------------------

https://reviews.apache.org/r/63316/ is the current candidate to address this issue

> Scheduler livelock during startup
> ---------------------------------
>
>                 Key: AURORA-1953
>                 URL: https://issues.apache.org/jira/browse/AURORA-1953
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 0.18.0
>            Reporter: Bill Farner
>            Priority: Blocker
>
> The scheduler may experience a "livelock" situation while starting up due to async events
on a {{ThreadPoolExecutor}} that require other not-yet-executed events to be processed.  If
enough of these blocking events occur simultaneously, no further event processing occurs and
the scheduler stalls.
> More specifically, this section of {{TaskGroups}} is afflicted:
> {code}
> CompletableFuture<Set<String>> result = batchWorker.execute(storeProvider
->
>     taskScheduler.schedule(storeProvider, taskIds));
> Set<String> scheduled = null;
> try {
>   scheduled = result.get();
> {code}
> {{batchWorker#execute}} submits to a queue that is not processed until a {{SchedulerActive}}
event is fired within the scheduler.  {{SchedulerActive}} is sent via an {{AsyncEventBus}}
which happens to also trigger the above code from {{TaskGroups}}.  Therefore, the following
sequence of events will cause a livelock:
> {noformat}
> TaskStateChange=pending
> TaskStateChange=pending
> TaskStateChange=pending
> TaskStateChange=pending
> TaskStateChange=pending
> TaskStateChange=pending
> TaskStateChange=pending
> TaskStateChange=pending
> DriverRegistered
> {noformat}
> Any other events may occur between the above calls, but the important sequence is N {{TaskStateChange=pending}}
events, where N={{-async_worker_threads}} followed by {{DriverRegistered}}.
> This issue was exacerbated by [f2755e1|https://github.com/apache/aurora/commit/f2755e1cdd67f3c1516726c21d6e8f13059a5a01],
which has the subtle effect of not using {{GatingDelayExecutor#closeDuring()}}, which would
enqueue all these events until storage recovery is complete.  The on-demand execution greatly
increases the likelihood of the above event sequence, since driver registration begins strictly
after storage recovery completes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message