mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joris Van Remoortere (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-3595) Framework process hangs after master failover when number frameworks > libprocess thread pool size
Date Tue, 26 Jan 2016 08:11:39 GMT

     [ https://issues.apache.org/jira/browse/MESOS-3595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joris Van Remoortere updated MESOS-3595:
----------------------------------------
          Sprint: Mesosphere Sprint 24, Mesosphere Sprint 27  (was: Mesosphere Sprint 24)
    Story Points: 3

> Framework process hangs after master failover when number frameworks > libprocess
thread pool size
> --------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-3595
>                 URL: https://issues.apache.org/jira/browse/MESOS-3595
>             Project: Mesos
>          Issue Type: Bug
>          Components: scheduler driver
>    Affects Versions: 0.24.1
>            Reporter: Mandeep Chadha
>            Assignee: Mandeep Chadha
>              Labels: mesosphere
>
> When running multi framework instances per process, if the number of framework created
exceeds the libprocess threads then during master failover the zookeeper updates can cause
deadlock. E.g. On a machine with 24 cpus, if the framework instance count exceeds 24 ( per
process)  then when the master fails over all the libprocess threads block updating the cache
( GroupProcess) leading to deadlock. Below is the stack trace of one the libprocess thread
:
> {code}
> Thread 101 (Thread 0x7f42821f1700 (LWP 5974)):
> #0  0x000000314100b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> #1  0x00007f42870d1637 in Gate::arrive(long) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #2  0x00007f42870be87c in process::ProcessManager::wait(process::UPID const&) ()
from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.eg
> g/mesos/native/_mesos.so
> #3  0x00007f42870c25f7 in process::wait(process::UPID const&, Duration const&)
() from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.e
> gg/mesos/native/_mesos.so
> #4  0x00007f428708e294 in process::Latch::await(Duration const&) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/nativ
> e/_mesos.so
> #5  0x00007f4286b67dea in process::Future<int>::await(Duration const&) const
() from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg
> /mesos/native/_mesos.so
> #6  0x00007f4286b5a0df in process::Future<int>::get() const () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_me
> sos.so
> #7  0x00007f4286ff0508 in ZooKeeper::getChildren(std::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, bool, std::vector<std::basic_string<char,
std::cha
> r_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char,
std::char_traits<char>, std::allocator<char> > > >*) () from /Users/mchadha/venv/lib/python2.7/site
> -packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #8  0x00007f4286cb394e in zookeeper::GroupProcess::cache() () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mes
> os.so
> #9  0x00007f4286cb1e63 in zookeeper::GroupProcess::updated(long, std::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&) () from /Users/mchadha/venv/lib/py
> thon2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #10 0x00007f4286ce027a in std::tr1::_Mem_fn<void (zookeeper::GroupProcess::*)(long,
std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>::operator()(zo
> okeeper::GroupProcess*, long, std::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&) const () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.n
> ative-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #11 0x00007f4286ce0067 in std::tr1::result_of<std::tr1::_Mem_fn<void (zookeeper::GroupProcess::*)(long,
std::basic_string<char, std::char_traits<char>, std::allocator<char> > con
> st&)> ()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>,
false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<zookeeper::GroupProcess*&>)>::type,
std::tr1::res
> ult_of<std::tr1::_Mu<long, false, false> ()(long, std::tr1::_Mu<std::tr1::_Placeholder<1>,
false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<zookeeper::GroupProcess*&>))
> >::type, std::tr1::result_of<std::tr1::_Mu<std::basic_string<char, std::char_traits<char>,
std::allocator<char> >, false, false> ()(std::basic_string<char, std::char_traits<char>
> , std::allocator<char> >, std::tr1::_Mu<std::tr1::_Placeholder<1>,
false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<zookeeper::GroupProcess*&>))>::type)>::type
std::tr1
> ::_Bind<std::tr1::_Mem_fn<void (zookeeper::GroupProcess::*)(long, std::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&)> ()(std::tr1::_Placeholder<1>,
lo
> ng, std::basic_string<char, std::char_traits<char>, std::allocator<char>
>)>::__call<zookeeper::GroupProcess*&, 0, 1, 2>(std::tr1::_Mu<std::tr1::_Placeholder<1>,
false, true> ( c
> onst&)(std::tr1::_Placeholder<1>, std::tr1::tuple<zookeeper::GroupProcess*&>),
std::tr1::_Index_tuple<0, 1, 2>) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.nati
> ve-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #12 0x00007f4286cdfd16 in std::tr1::result_of<std::tr1::_Mem_fn<void (zookeeper::GroupProcess::*)(long,
std::basic_string<char, std::char_traits<char>, std::allocator<char> > con
> st&)> ()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>,
false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<zookeeper::GroupProcess*>)>::type,
std::tr1::resu
> lt_of<std::tr1::_Mu<long, false, false> ()(long, std::tr1::_Mu<std::tr1::_Placeholder<1>,
false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<zookeeper::GroupProcess*>))>:
> :type, std::tr1::result_of<std::tr1::_Mu<std::basic_string<char, std::char_traits<char>,
std::allocator<char> >, false, false> ()(std::basic_string<char, std::char_traits<char>,
> std::allocator<char> >, std::tr1::_Mu<std::tr1::_Placeholder<1>, false,
true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<zookeeper::GroupProcess*>))>::type)>::type
std::tr1::_
> Bind<std::tr1::_Mem_fn<void (zookeeper::GroupProcess::*)(long, std::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&)> ()(std::tr1::_Placeholder<1>,
long,
>  std::basic_string<char, std::char_traits<char>, std::allocator<char>
>)>::operator()<zookeeper::GroupProcess*>(zookeeper::GroupProcess*&) () from
/Users/mchadha/venv/lib/python2
> .7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #13 0x00007f4286cdf8be in std::tr1::_Function_handler<void ()(zookeeper::GroupProcess*),
std::tr1::_Bind<std::tr1::_Mem_fn<void (zookeeper::GroupProcess::*)(long, std::basic_stri
> ng<char, std::char_traits<char>, std::allocator<char> > const&)>
()(std::tr1::_Placeholder<1>, long, std::basic_string<char, std::char_traits<char>,
std::allocator<char> >)> >::_
> M_invoke(std::tr1::_Any_data const&, zookeeper::GroupProcess*) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/
> _mesos.so
> #14 0x00007f4286cc2394 in std::tr1::function<void ()(zookeeper::GroupProcess*)>::operator()(zookeeper::GroupProcess*)
const () from /Users/mchadha/venv/lib/python2.7/site-package
> s/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #15 0x00007f4286cbc3a2 in void process::internal::vdispatcher<zookeeper::GroupProcess>(process::ProcessBase*,
std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProc
> ess*)> >) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #16 0x00007f4286ccdca5 in std::tr1::result_of<void (*()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>,
false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<pr
> ocess::ProcessBase*&>)>::type, std::tr1::result_of<std::tr1::_Mu<std::tr1::shared_ptr<std::tr1::function<void
()(zookeeper::GroupProcess*)> >, false, false> ()(std::tr1::shared_p
> tr<std::tr1::function<void ()(zookeeper::GroupProcess*)> >, std::tr1::_Mu<std::tr1::_Placeholder<1>,
false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBa
> se*&>))>::type))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void
()(zookeeper::GroupProcess*)> >)>::type std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>,
> std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> >))(process::ProcessBase*,
std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> >
> )>::__call<process::ProcessBase*&, 0, 1>(std::tr1::_Mu<std::tr1::_Placeholder<1>,
false, true> ( const&)(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*&>),
std:
> :tr1::_Index_tuple<0, 1>) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #17 0x00007f4286cc7a5a in std::tr1::result_of<void (*()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>,
false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<pr
> ocess::ProcessBase*>)>::type, std::tr1::result_of<std::tr1::_Mu<std::tr1::shared_ptr<std::tr1::function<void
()(zookeeper::GroupProcess*)> >, false, false> ()(std::tr1::shared_pt
> r<std::tr1::function<void ()(zookeeper::GroupProcess*)> >, std::tr1::_Mu<std::tr1::_Placeholder<1>,
false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBas
> e*>))>::type))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void
()(zookeeper::GroupProcess*)> >)>::type std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>,
st
> d::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> >))(process::ProcessBase*,
std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> >)>
> ::operator()<process::ProcessBase*>(process::ProcessBase*&) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_me
> sos.so
> #18 0x00007f4286cc2480 in std::tr1::_Function_handler<void ()(process::ProcessBase*),
std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function
> <void ()(zookeeper::GroupProcess*)> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void
()(zookeeper::GroupProcess*)> >)> >::_M_invoke(std::tr1::_Any_data con
> st&, process::ProcessBase*) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #19 0x00007f42870db546 in std::tr1::function<void ()(process::ProcessBase*)>::operator()(process::ProcessBase*)
const () from /Users/mchadha/venv/lib/python2.7/site-packages/meso
> s.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #20 0x00007f42870c1013 in process::ProcessBase::visit(process::DispatchEvent const&)
() from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x8
> 6_64.egg/mesos/native/_mesos.so
> #21 0x00007f42870c5582 in process::DispatchEvent::visit(process::EventVisitor*) const
() from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x
> 86_64.egg/mesos/native/_mesos.so
> #22 0x00007f428666680e in process::ProcessBase::serve(process::Event const&) () from
/Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg
> /mesos/native/_mesos.so
> #23 0x00007f42870bd88f in process::ProcessManager::resume(process::ProcessBase*) () from
/Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64
> .egg/mesos/native/_mesos.so
> #24 0x00007f42870b1cb9 in process::schedule(void*) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #25 0x00000031410079d1 in start_thread () from /lib64/libpthread.so.0
> #26 0x00000031408e88fd in clone () from /lib64/libc.so.6
> {code}
> Solution: 
>  Create master detector per url instead of per framework.
> Will send the review request. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message