mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Mahler <bmah...@apache.org>
Subject Re: Review Request 63208: Fixed a crash in ProcessManager::resume due to race.
Date Sun, 22 Oct 2017 18:22:01 GMT


> On Oct. 22, 2017, 9:09 a.m., Benjamin Hindman wrote:
> > 3rdparty/libprocess/src/process.cpp
> > Lines 3291 (patched)
> > <https://reviews.apache.org/r/63208/diff/1/?file=1865231#file1865231line3291>
> >
> >     Any way we could not have to grab this reference each time we loop? This is
in the hot path.

Yeah, I'll do that, it's also not a complete fix I realized because there's an access to `process->manage`
below the loop!


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63208/#review188906
-----------------------------------------------------------


On Oct. 22, 2017, 3 a.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63208/
> -----------------------------------------------------------
> 
> (Updated Oct. 22, 2017, 3 a.m.)
> 
> 
> Review request for mesos and Benjamin Hindman.
> 
> 
> Bugs: MESOS-7921
>     https://issues.apache.org/jira/browse/MESOS-7921
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When ProcessManager::resume moves the process into a BLOCKED
> state, it's possible that a TerminateEvent is enqueued and the
> process is placed back in the run queue. Another worker thread
> can pick it off the run queue, process the TerminateEvent, and
> delete the process! At this point, when the original thread
> in ProcessManager::resume tries to check if any events were
> enqueued before transitioning to BLOCKED, it will access a
> deleted process and crash.
> 
> Some example crash paths involving process::Latch below.
> 
> Path 1:
> 
> T1 creates latch, spawns latch process, and puts it in run queue
> T1 waits on latch
> T1 ProcessManager::wait on latch see it in BOTTOM
> 
> T2 worker thread dequeues the latch process
> T2 ProcessManager::resume runs initialize, moves it to READY
> T2 ProcessManager::resume sees empty queue
> T2 ProcessManager::resume sets to BLOCKED
> 
> T3 triggers the latch, terminates the latch process
> T3 enqueue TerminateEvent
> T3 enqueue sees state BLOCKED
> T3 swaps from BLOCKED->READY & enqueues latch process into run queue
> 
> T1 extracts latch process from run queue
> T1 ProcessManager::resume sees READY
> T1 ProcessManager::resume dequeues terminate event
> T1 ProcessManager::resume calls ProcessManager::cleanup
> T1 ProcessManager::cleanup sets to TERMINATING
> T1 ProcessManager::cleanup decommissions event queue
> T1 ProcessManager::cleanup waits for latch refs to go away
> T1 ProcessManager::cleanup calls SocketManager::exited
> T1 ProcessManager::cleanup opens gate
> T1 ProcessManager::cleanup deletes the latch process
> 
> T2 ProcessManager::resume checks if event queue is empty again (crash)
> 
> Path 2:
> 
> T1 creates latch, spawns latch process, and puts it in run queue
> T1 waits on latch
> T1 ProcessManager::wait on latch see it in BOTTOM
> T1 ProcessManager::wait extracts latch process from run queue
> T1 ProcessManager::resume runs initialize, moves it to READY
> T1 ProcessManager::resume sees empty queue
> T1 ProcessManager::resume sets to BLOCKED
> 
> T3 triggers the latch, terminates the latch process
> T3 enqueue TerminateEvent
> T3 enqueue sees state BLOCKED
> T3 swaps from BLOCKED->READY & enqueues latch process into run queue
> 
> T2 worker thread dequeues the latch process
> T2 ProcessManager::resume sees READY
> T2 ProcessManager::resume dequeues terminate event
> T2 ProcessManager::resume calls ProcessManager::cleanup
> T2 ProcessManager::cleanup sets to TERMINATING
> T2 ProcessManager::cleanup decommissions event queue
> T2 ProcessManager::cleanup waits for latch refs to go away
> T2 ProcessManager::cleanup calls SocketManager::exited
> T2 ProcessManager::cleanup opens gate
> T2 ProcessManager::resume deletes the latch process
> 
> T1 ProcessManager::resume checks if event queue is empty again (crash)
> 
> 
> Diffs
> -----
> 
>   3rdparty/libprocess/src/process.cpp 4d8d483cfa5c3bac8bbe6a985f1cc2b737ae691e 
> 
> 
> Diff: https://reviews.apache.org/r/63208/diff/1/
> 
> 
> Testing
> -------
> 
> * Tested by injecting a sleep [here](https://github.com/apache/mesos/blob/1.4.0/3rdparty/libprocess/src/process.cpp#L3278).
> * Ran throughput benchmark, results are highly variable across runs, see results [here](https://docs.google.com/spreadsheets/d/1yoQtvAWiojLWBV0TlqjCxdc1RMtNERGSEZzgCxkWoxI/edit?usp=sharing).
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message