mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Mahler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-7921) process::EventQueue sometimes crashes
Date Sun, 22 Oct 2017 01:12:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214146#comment-16214146
] 

Benjamin Mahler commented on MESOS-7921:
----------------------------------------

[~benjaminhindman] I took a look at {{process::Latch}}, knowing that it can trigger the crash
thanks to traces that [~alexr] shared with me. Here are some example crash paths.

Path 1, produces the stack trace originally shown in this ticket:

{noformat}
T1 creates latch, spawns latch process, and puts it in run queue
T1 waits on latch
T1 ProcessManager::wait on latch see it in BOTTOM

T2 worker thread dequeues the latch process
T2 ProcessManager::resume runs initialize, moves it to READY
T2 ProcessManager::resume sees empty queue
T2 ProcessManager::resume sets to BLOCKED

T3 triggers the latch, terminates the latch process
T3 enqueue TerminateEvent
T3 enqueue sees state BLOCKED
T3 swaps from BLOCKED->READY & enqueues latch process into run queue

T1 extracts latch process from run queue
T1 ProcessManager::resume sees READY
T1 ProcessManager::resume dequeues terminate event
T1 ProcessManager::resume calls ProcessManager::cleanup
T1 ProcessManager::cleanup sets to TERMINATING
T1 ProcessManager::cleanup decommissions event queue
T1 ProcessManager::cleanup waits for latch refs to go away
T1 ProcessManager::cleanup calls SocketManager::exited
T1 ProcessManager::cleanup opens gate
T1 ProcessManager::cleanup deletes the latch process

T2 ProcessManager::resume checks if event queue is empty again (crash)
{noformat}

Path 2, produces the alternate stack trace involving latch shown below:

{noformat}
T1 creates latch, spawns latch process, and puts it in run queue
T1 waits on latch
T1 ProcessManager::wait on latch see it in BOTTOM
T1 ProcessManager::wait extracts latch process from run queue, calls resume
T1 ProcessManager::resume runs initialize, moves it to READY
T1 ProcessManager::resume sees empty queue
T1 ProcessManager::resume sets to BLOCKED

T3 triggers the latch, terminates the latch process
T3 enqueue TerminateEvent
T3 enqueue sees state BLOCKED
T3 swaps from BLOCKED->READY & enqueues latch process into run queue

T2 worker thread dequeues the latch process
T2 ProcessManager::resume sees READY
T2 ProcessManager::resume dequeues terminate event
T2 ProcessManager::resume calls ProcessManager::cleanup
T2 ProcessManager::cleanup sets to TERMINATING
T2 ProcessManager::cleanup decommissions event queue
T2 ProcessManager::cleanup waits for latch refs to go away
T2 ProcessManager::cleanup calls SocketManager::exited
T2 ProcessManager::cleanup opens gate
T2 ProcessManager::resume deletes the latch process

T1 ProcessManager::resume checks if event queue is empty again (crash)
{noformat}

{noformat}
*** Aborted at 1508426752 (unix time) try "date -d @1508426752" if you are using GNU date
***
PC: @     0x7fa9ed827d14 process::EventQueue::Consumer::empty()
*** SIGSEGV (@0x8) received by PID 4537 (TID 0x7fa9ef594800) from PID 8; stack trace: ***
    @     0x7fa9e4e7a390 (unknown)
    @     0x7fa9ed827d14 process::EventQueue::Consumer::empty()
    @     0x7fa9ed813195 process::ProcessManager::resume()
    @     0x7fa9ed814256 process::ProcessManager::wait()
    @     0x7fa9ed819461 process::wait()
    @     0x7fa9ed7bcf3b process::Latch::await()
    @     0x5643c8b83967 process::Future<>::await()
    @     0x7fa9ec8ff81c mesos::internal::slave::FetcherProcess::Metrics::~Metrics()
    @     0x7fa9ec8fff2e mesos::internal::slave::FetcherProcess::~FetcherProcess()
    @     0x7fa9ec8fffb2 mesos::internal::slave::FetcherProcess::~FetcherProcess()
    @     0x5643c804fedb process::Owned<>::Data::~Data()
    ...
{noformat}

One fix I can see here is for {{ProcessManager::resume}} to hold a reference to the process
while it's looping over its event queue, and releasing that reference before it calls into
cleanup.

> process::EventQueue sometimes crashes
> -------------------------------------
>
>                 Key: MESOS-7921
>                 URL: https://issues.apache.org/jira/browse/MESOS-7921
>             Project: Mesos
>          Issue Type: Bug
>          Components: libprocess
>    Affects Versions: 1.4.0
>         Environment: autotools,gcc,--verbose,GLOG_v=1 MESOS_VERBOSE=1,ubuntu:14.04,(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)
> Note that --enable-lock-free-event-queue is not enabled.
> Details: https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)/4159/injectedEnvVars/
>            Reporter: Yan Xu
>            Assignee: Benjamin Hindman
>            Priority: Blocker
>             Fix For: 1.4.0
>
>         Attachments: FetcherCacheTest.CachedCustomOutputFileWithSubdirectory.log.txt,
MesosContainerizerSlaveRecoveryTest.ResourceStatisticsFullLog.txt
>
>
> The following segfault is found on [ASF|https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)/4159/]
in {{MesosContainerizerSlaveRecoveryTest.ResourceStatistics}} but it's flaky and shows up
in other tests and environments (with or without --enable-lock-free-event-queue) as well.
> {noformat: title=Configuration}
> ./bootstrap '&&' ./configure --verbose '&&' make -j6 distcheck
> {noformat}
> {noformat:title=}
> *** Aborted at 1503937885 (unix time) try "date -d @1503937885" if you are using GNU
date ***
> PC: @     0x2b9e2581caa0 process::EventQueue::Consumer::empty()
> *** SIGSEGV (@0x8) received by PID 751 (TID 0x2b9e31978700) from PID 8; stack trace:
***
>     @     0x2b9e29d26330 (unknown)
>     @     0x2b9e2581caa0 process::EventQueue::Consumer::empty()
>     @     0x2b9e25800a40 process::ProcessManager::resume()
>     @     0x2b9e2580f891 process::ProcessManager::init_threads()::$_9::operator()()
>     @     0x2b9e2580f7d5 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_9vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
>     @     0x2b9e2580f7a5 std::_Bind_simple<>::operator()()
>     @     0x2b9e2580f77c std::thread::_Impl<>::_M_run()
>     @     0x2b9e29fe5a60 (unknown)
>     @     0x2b9e29d1e184 start_thread
>     @     0x2b9e2a851ffd (unknown)
> make[3]: *** [CMakeFiles/check] Segmentation fault (core dumped)
> {noformat}
> A builds@mesos.apache.org query shows many such instances: https://lists.apache.org/list.html?builds@mesos.apache.org:lte=1M:process%3A%3AEventQueue%3A%3AConsumer%3A%3Aempty



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message