aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mehrdad Nurolahzade (JIRA)" <>
Subject [jira] [Commented] (AURORA-1869) Investigate the status update processing overhead
Date Fri, 05 May 2017 19:48:04 GMT


Mehrdad Nurolahzade commented on AURORA-1869:

{{TaskStatusHandlerImpl}} acquires {{LogStorage}} write lock for processing every status update
received from Mesos master. During implicit and explicit reconciliations, this amounts to
the number of tasks in the cluster (tens of thousands of times in our cluster). 

According to data extracted from one of our production clusters, over 99.9% of reconciliation
status update events are in fact {{NOOP}} status updates (as described above). The storage
write lock contention induced by these status updates can simply be eliminated by adopting
double-checked locking pattern (as was done in [AURORA-1820]).

This explains why the combination of reconciliation status update processing and other expensive
processes like snapshot can be fatal for scheduler. As the lock is not fair, it does not guarantee
any particular access order. Therefore, snapshot structures might need to sit on the heap
for a few seconds before they can be written to {{LogStorage}} and garbage collected.

> Investigate the status update processing overhead
> -------------------------------------------------
>                 Key: AURORA-1869
>                 URL:
>             Project: Aurora
>          Issue Type: Task
>          Components: Scheduler
>            Reporter: Mehrdad Nurolahzade
>            Priority: Minor
> There is a peculiar similarity pattern between the number of task status update events
received from Mesos and the number of JVM threads started by the system ([graphview|]).
It seems like a new thread is started every time a status update event is processed.
> {{TaskStatusHandlerImpl}} is a single-threaded service, therefore it should not instantiate
new threads. Looking at status update reasons/results, the majority of status updates are
associated with {{RECONCILIATION}} and should result in {{NOOP}}. Therefore, they should have
no impact on the internal workers. The task state machine should short-circuit and return
upon realizing that the task’s reported new state corresponds to scheduler’s view.
> {code:title=TaskStateMachine.updateState()}
> if (stateMachine.getState() == taskState) {
>   return new TransitionResult(NOOP, ImmutableSet.of());
> }
> {code}
> Given the volume of status update events received upon reconciliation this overhead needs
to be avoided, if possible.

This message was sent by Atlassian JIRA

View raw message