Return-Path: X-Original-To: apmail-mesos-issues-archive@minotaur.apache.org Delivered-To: apmail-mesos-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3B80618BCA for ; Tue, 26 Jan 2016 08:11:40 +0000 (UTC) Received: (qmail 82924 invoked by uid 500); 26 Jan 2016 08:11:40 -0000 Delivered-To: apmail-mesos-issues-archive@mesos.apache.org Received: (qmail 82891 invoked by uid 500); 26 Jan 2016 08:11:40 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 82879 invoked by uid 99); 26 Jan 2016 08:11:40 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Jan 2016 08:11:40 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id D87832C14F0 for ; Tue, 26 Jan 2016 08:11:39 +0000 (UTC) Date: Tue, 26 Jan 2016 08:11:39 +0000 (UTC) From: "Joris Van Remoortere (JIRA)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (MESOS-3595) Framework process hangs after master failover when number frameworks > libprocess thread pool size MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MESOS-3595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-3595: ---------------------------------------- Sprint: Mesosphere Sprint 24, Mesosphere Sprint 27 (was: Mesosphere Sprint 24) Story Points: 3 > Framework process hangs after master failover when number frameworks > libprocess thread pool size > -------------------------------------------------------------------------------------------------- > > Key: MESOS-3595 > URL: https://issues.apache.org/jira/browse/MESOS-3595 > Project: Mesos > Issue Type: Bug > Components: scheduler driver > Affects Versions: 0.24.1 > Reporter: Mandeep Chadha > Assignee: Mandeep Chadha > Labels: mesosphere > > When running multi framework instances per process, if the number of framework created exceeds the libprocess threads then during master failover the zookeeper updates can cause deadlock. E.g. On a machine with 24 cpus, if the framework instance count exceeds 24 ( per process) then when the master fails over all the libprocess threads block updating the cache ( GroupProcess) leading to deadlock. Below is the stack trace of one the libprocess thread : > {code} > Thread 101 (Thread 0x7f42821f1700 (LWP 5974)): > #0 0x000000314100b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 > #1 0x00007f42870d1637 in Gate::arrive(long) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #2 0x00007f42870be87c in process::ProcessManager::wait(process::UPID const&) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.eg > g/mesos/native/_mesos.so > #3 0x00007f42870c25f7 in process::wait(process::UPID const&, Duration const&) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.e > gg/mesos/native/_mesos.so > #4 0x00007f428708e294 in process::Latch::await(Duration const&) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/nativ > e/_mesos.so > #5 0x00007f4286b67dea in process::Future::await(Duration const&) const () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg > /mesos/native/_mesos.so > #6 0x00007f4286b5a0df in process::Future::get() const () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_me > sos.so > #7 0x00007f4286ff0508 in ZooKeeper::getChildren(std::basic_string, std::allocator > const&, bool, std::vector r_traits, std::allocator >, std::allocator, std::allocator > > >*) () from /Users/mchadha/venv/lib/python2.7/site > -packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #8 0x00007f4286cb394e in zookeeper::GroupProcess::cache() () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mes > os.so > #9 0x00007f4286cb1e63 in zookeeper::GroupProcess::updated(long, std::basic_string, std::allocator > const&) () from /Users/mchadha/venv/lib/py > thon2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #10 0x00007f4286ce027a in std::tr1::_Mem_fn, std::allocator > const&)>::operator()(zo > okeeper::GroupProcess*, long, std::basic_string, std::allocator > const&) const () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.n > ative-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #11 0x00007f4286ce0067 in std::tr1::result_of, std::allocator > con > st&)> ()(std::tr1::result_of, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple)>::type, std::tr1::res > ult_of ()(long, std::tr1::_Mu, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple)) > >::type, std::tr1::result_of, std::allocator >, false, false> ()(std::basic_string > , std::allocator >, std::tr1::_Mu, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple))>::type)>::type std::tr1 > ::_Bind, std::allocator > const&)> ()(std::tr1::_Placeholder<1>, lo > ng, std::basic_string, std::allocator >)>::__call(std::tr1::_Mu, false, true> ( c > onst&)(std::tr1::_Placeholder<1>, std::tr1::tuple), std::tr1::_Index_tuple<0, 1, 2>) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.nati > ve-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #12 0x00007f4286cdfd16 in std::tr1::result_of, std::allocator > con > st&)> ()(std::tr1::result_of, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple)>::type, std::tr1::resu > lt_of ()(long, std::tr1::_Mu, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple))>: > :type, std::tr1::result_of, std::allocator >, false, false> ()(std::basic_string, > std::allocator >, std::tr1::_Mu, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple))>::type)>::type std::tr1::_ > Bind, std::allocator > const&)> ()(std::tr1::_Placeholder<1>, long, > std::basic_string, std::allocator >)>::operator()(zookeeper::GroupProcess*&) () from /Users/mchadha/venv/lib/python2 > .7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #13 0x00007f4286cdf8be in std::tr1::_Function_handler ng, std::allocator > const&)> ()(std::tr1::_Placeholder<1>, long, std::basic_string, std::allocator >)> >::_ > M_invoke(std::tr1::_Any_data const&, zookeeper::GroupProcess*) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/ > _mesos.so > #14 0x00007f4286cc2394 in std::tr1::function::operator()(zookeeper::GroupProcess*) const () from /Users/mchadha/venv/lib/python2.7/site-package > s/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #15 0x00007f4286cbc3a2 in void process::internal::vdispatcher(process::ProcessBase*, std::tr1::shared_ptr ess*)> >) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #16 0x00007f4286ccdca5 in std::tr1::result_of, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple ocess::ProcessBase*&>)>::type, std::tr1::result_of >, false, false> ()(std::tr1::shared_p > tr >, std::tr1::_Mu, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple se*&>))>::type))(process::ProcessBase*, std::tr1::shared_ptr >)>::type std::tr1::_Bind, > std::tr1::shared_ptr >))(process::ProcessBase*, std::tr1::shared_ptr > > )>::__call(std::tr1::_Mu, false, true> ( const&)(std::tr1::_Placeholder<1>, std::tr1::tuple), std: > :tr1::_Index_tuple<0, 1>) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #17 0x00007f4286cc7a5a in std::tr1::result_of, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple ocess::ProcessBase*>)>::type, std::tr1::result_of >, false, false> ()(std::tr1::shared_pt > r >, std::tr1::_Mu, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple e*>))>::type))(process::ProcessBase*, std::tr1::shared_ptr >)>::type std::tr1::_Bind, st > d::tr1::shared_ptr >))(process::ProcessBase*, std::tr1::shared_ptr >)> > ::operator()(process::ProcessBase*&) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_me > sos.so > #18 0x00007f4286cc2480 in std::tr1::_Function_handler, std::tr1::shared_ptr >))(process::ProcessBase*, std::tr1::shared_ptr >)> >::_M_invoke(std::tr1::_Any_data con > st&, process::ProcessBase*) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #19 0x00007f42870db546 in std::tr1::function::operator()(process::ProcessBase*) const () from /Users/mchadha/venv/lib/python2.7/site-packages/meso > s.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #20 0x00007f42870c1013 in process::ProcessBase::visit(process::DispatchEvent const&) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x8 > 6_64.egg/mesos/native/_mesos.so > #21 0x00007f42870c5582 in process::DispatchEvent::visit(process::EventVisitor*) const () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x > 86_64.egg/mesos/native/_mesos.so > #22 0x00007f428666680e in process::ProcessBase::serve(process::Event const&) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg > /mesos/native/_mesos.so > #23 0x00007f42870bd88f in process::ProcessManager::resume(process::ProcessBase*) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64 > .egg/mesos/native/_mesos.so > #24 0x00007f42870b1cb9 in process::schedule(void*) () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #25 0x00000031410079d1 in start_thread () from /lib64/libpthread.so.0 > #26 0x00000031408e88fd in clone () from /lib64/libc.so.6 > {code} > Solution: > Create master detector per url instead of per framework. > Will send the review request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)