aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mehrdad Nurolahzade <mehr...@apache.org>
Subject Review Request 59030: AURORA-1869 Reducing storage write lock contention in TaskStatusHandlerImpl
Date Fri, 05 May 2017 21:36:12 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59030/
-----------------------------------------------------------

Review request for Aurora, David McLaughlin, Stephan Erb, and Zameer Manji.


Bugs: AURORA-1869
    https://issues.apache.org/jira/browse/AURORA-1869


Repository: aurora


Description
-------

`TaskStatusHandlerImpl` acquires `LogStorage` write lock for processing every status update
received from Mesos master. During implicit and explicit reconciliations, this amounts to
the number of tasks in the cluster (tens of thousands of times in our cluster).

According to data extracted from one of our production clusters, over 99.9% of reconciliation
status update events are in fact `NOOP` status updates. The storage write lock contention
induced by these status updates can simply be eliminated by adopting double-checked locking
pattern (as was done in [AURORA-1820](https://issues.apache.org/jira/browse/AURORA-1820)).

This explains why the combination of reconciliation status update processing and other expensive
processes like snapshot can be fatal for scheduler. As the lock is not fair, it does not guarantee
any particular access order. Therefore, snapshot structures might need to sit on the heap
for a few seconds before they can be written to `LogStorage` and garbage collected.


Diffs
-----

  src/main/java/org/apache/aurora/scheduler/TaskStatusHandlerImpl.java 1aacecf3c2597a3f91dbc7da4c99fd1e80970f04

  src/test/java/org/apache/aurora/scheduler/TaskStatusHandlerImplTest.java 56a6b0c9ae8da18e9a47428b8ed37a559cfd04e7

  src/test/java/org/apache/aurora/scheduler/storage/testing/StorageTestUtil.java 21d26b3930ea965487b2dec48a48a98677ba022b



Diff: https://reviews.apache.org/r/59030/diff/1/


Testing
-------

TBD under a test cluster


Thanks,

Mehrdad Nurolahzade


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message