Return-Path: X-Original-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AA72DD53A for ; Thu, 1 Nov 2012 23:10:14 +0000 (UTC) Received: (qmail 38280 invoked by uid 500); 1 Nov 2012 23:10:14 -0000 Delivered-To: apmail-incubator-mesos-dev-archive@incubator.apache.org Received: (qmail 38254 invoked by uid 500); 1 Nov 2012 23:10:14 -0000 Mailing-List: contact mesos-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mesos-dev@incubator.apache.org Delivered-To: mailing list mesos-dev@incubator.apache.org Received: (qmail 38245 invoked by uid 99); 1 Nov 2012 23:10:14 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Nov 2012 23:10:14 +0000 Date: Thu, 1 Nov 2012 23:10:14 +0000 (UTC) From: "Benjamin Hindman (JIRA)" To: mesos-dev@incubator.apache.org Message-ID: <371942092.58402.1351811414526.JavaMail.jiratomcat@arcas> In-Reply-To: <1329637047.56808.1351789512182.JavaMail.jiratomcat@arcas> Subject: [jira] [Updated] (MESOS-303) mesos slave crashes during framework termination MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MESOS-303?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Hindman updated MESOS-303: ----------------------------------- Attachment: copy_executor.patch I've reproduced what appears to be this bug and this patch fixes the issue. =20 > mesos slave crashes during framework termination > ------------------------------------------------ > > Key: MESOS-303 > URL: https://issues.apache.org/jira/browse/MESOS-303 > Project: Mesos > Issue Type: Bug > Environment: Ubuntu 11.04 > Reporter: Erich Nachbar > Priority: Critical > Attachments: copy_executor.patch, copy_executor.patch > > > Hi, > I'm running Spark 0.6.0 on Mesos trunk (5230fea125b0b) and see my mesos s= laves terminating when a Spark job is aborted (CTRL-C). > The logs only show a Segfault message, but I obtained a backtrace through= gdb to give a little more context. > Mesos passes all checks (make check) except for the linux container. > Mesos was built using: ./configure.ubuntu-natty-64 --with-zookeeper --wit= h-webui > Mesos slave command: mesos-slave --master=3Dzk://szk0:2181/mesos > Here are the last few lines leading up to the segfault using gdb: > 2012-10-31 22:15:35,698:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process= @1983: Got ping response in 0 ms > 2012-10-31 22:15:39,047:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process= @1983: Got ping response in 13 ms > 2012-10-31 22:15:42,385:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process= @1983: Got ping response in 15 ms > I1031 22:15:45.434877 29511 slave.cpp:652] Asked to shut down framework 2= 01210312057-1560611338-5050-24091-0009 > I1031 22:15:45.435017 29511 slave.cpp:656] Shutting down framework 201210= 312057-1560611338-5050-24091-0009 > I1031 22:15:45.435387 29511 slave.cpp:1102] Shutting down executor 'defau= lt' of framework 201210312057-1560611338-5050-24091-0009 > 2012-10-31 22:15:45,707:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process= @1983: Got ping response in 0 ms > 2012-10-31 22:15:49,044:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process= @1983: Got ping response in 0 ms > I1031 22:15:50.437018 29495 slave.cpp:1131] Killing executor 'default' of= framework 201210312057-1560611338-5050-24091-0009 > I1031 22:15:50.439749 29502 gc.cpp:97] Scheduling /tmp/mesos/slaves/20121= 0312057-1560611338-5050-24091-22/frameworks/201210312057-1560611338-5050-24= 091-0009/executors/default/runs/74aa6767-e45c-40db-8bfd-5aaf9960fabe for re= moval > /usr/local/libexec/mesos/killtree.sh: line 229: echo: write error: Broken= pipe > /usr/local/libexec/mesos/killtree.sh: line 135: echo: write error: Broken= pipe > root@shd0:~/mesos_git# /usr/local/libexec/mesos/killtree.sh: line 124: pr= intf: write error: Broken pipe > /usr/local/libexec/mesos/killtree.sh: line 124: printf: write error: Brok= en pipe > /usr/local/libexec/mesos/killtree.sh: line 229: echo: write error: Broken= pipe > -------------------------------------------------------------------------= ------------------ > Here is the backtrace from gdb: > #0 0x0000000000000000 in ?? () > #1 0x00007ffff74dbaf6 in mesos::internal::slave::Executor::~Executor() (= ) > from /usr/local/lib/libmesos-0.9.0.so > #2 0x00007ffff74ec00c in __gnu_cxx::new_allocator::destroy(mesos::internal::slave::Executor*) () from /usr/local/= lib/libmesos-0.9.0.so > #3 0x00007ffff74e3bd5 in std::_List_base >::_M_clear() () > from /usr/local/lib/libmesos-0.9.0.so > #4 0x00007ffff74de3df in std::_List_base >::~_List_base() () > from /usr/local/lib/libmesos-0.9.0.so > #5 0x00007ffff74dc670 in std::list >::~list() () from /usr/local= /lib/libmesos-0.9.0.so > #6 0x00007ffff74dc7fb in mesos::internal::slave::Framework::~Framework()= () > from /usr/local/lib/libmesos-0.9.0.so > #7 0x00007ffff74d87d5 in mesos::internal::slave::Slave::shutdownExecutor= Timeout(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&) (= ) > from /usr/local/lib/libmesos-0.9.0.so > #8 0x00007ffff7501313 in std::tr1::_Mem_fn::operator()(mesos::internal::slave::Slave*, mesos::FrameworkID const&, m= esos::ExecutorID const&, UUID const&) const > () from /usr/local/lib/libmesos-0.9.0.so > #9 0x00007ffff74fd404 in std::tr1::result_of ()(std::tr1::result_of, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple)>::type, std::tr1::result_of ()(mesos::FrameworkID, std::tr1::_Mu, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tu= ple))>::type, std::tr1::result_of ()(mesos::ExecutorID, std::tr1::_Mu<= std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::= tr1::tuple))>::type, std::tr1::result_of ()(UUID, std::tr1::_Mu, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple))>::type)>::type std::tr1::_Bind ()(std::tr1::_Placeholder<1>, mesos::Frame= workID, mesos::ExecutorID, UUID)>::__call(std::tr1::_Mu, false, true> ( const&)= (std::tr1::_Placeholder<1>, std::tr1::tuple), std::tr1::_Index_tuple<0, 1, 2, 3>) () from /usr/local/lib/libmesos-0.9= .0.so > #10 0x00007ffff74f7956 in std::tr1::result_of ()(std::tr1::result_of, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple)>::type, std::tr1::result_of ()(mesos::FrameworkID, std::tr1::_Mu, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tup= le))>::type, std::tr1::result_of ()(mesos::ExecutorID, std::tr1::_Mu, false, true> ()(std::tr1::_Placeholder<1>, std::tr= 1::tuple))>::type, std::tr1::result_of ()(UUID, std::tr1::_Mu, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple))>::type)>::type std::tr1::_Bind ()(std::tr1::_Placeholder<1>, mesos::Framework= ID, mesos::ExecutorID, UUID)>::operator()(m= esos::internal::slave::Slave*&) () > from /usr/local/lib/libmesos-0.9.0.so > #11 0x00007ffff74f12dc in std::tr1::_Function_handler ()(std::tr1::_Placeholder<1>, mesos::FrameworkID, mesos::Execu= torID, UUID)> >::_M_invoke(std::tr1::_Any_data const&, mesos::internal::sla= ve::Slave*) () from /usr/local/lib/libmesos-0.9.0.so > #12 0x00007ffff74ed58a in std::tr1::function::operator()(mesos::internal::slave::Slave*) const () from /usr= /local/lib/libmesos-0.9.0.so > #13 0x00007ffff74e508d in void process::internal::vdispatcher(process::ProcessBase*, std::tr1::shared_ptr >) () from /usr/local/lib/= libmesos-0.9.0.so > #14 0x00007ffff74f9be9 in std::tr1::result_of, false, true> ()(std::tr1::_Plac= eholder<1>, std::tr1::tuple)>::type, std::tr1::resu= lt_of >, false, false> ()(std::tr1::shared_ptr >, std::tr1::_Mu, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::t= uple))>::type))(process::ProcessBase*, std::tr1::sh= ared_ptr >)>::t= ype std::tr1::_Bind, std::tr1::shared_p= tr >))(process:= :ProcessBase*, std::tr1::shared_ptr >)>::__call(std::tr1::_M= u, false, true> ( const&)(std::tr1::_Placeholder<= 1>, std::tr1::tuple), std::tr1::_Index_tuple<0, 1>)= () > from /usr/local/lib/libmesos-0.9.0.so > #15 0x00007ffff74f3ce4 in std::tr1::result_of, false, true> ()(std::tr1::_Plac= eholder<1>, std::tr1::tuple)>::type, std::tr1::resul= t_of >, false, false> ()(std::tr1::shared_ptr >, std::tr1::_Mu, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tu= ple))>::type))(process::ProcessBase*, std::tr1::shar= ed_ptr >)>::typ= e std::tr1::_Bind, std::tr1::shared_ptr= >))(process::P= rocessBase*, std::tr1::shared_ptr >)>::operator()(process::ProcessB= ase*&) () from /usr/local/lib/libmesos-0.9.0.so > #16 0x00007ffff74ed676 in std::tr1::_Function_handler, std::tr1:= :shared_ptr >))= (process::ProcessBase*, std::tr1::shared_ptr >)> >::_M_invoke(std::tr1::_Any_data const&, = process::ProcessBase*) () from /usr/local/lib/libmesos-0.9.0.so > #17 0x00007ffff76eecd0 in std::tr1::function::operator()(process::ProcessBase*) const () from /usr/local/lib/libmeso= s-0.9.0.so > #18 0x00007ffff76da56b in process::ProcessBase::visit(process::DispatchEv= ent const&) () > from /usr/local/lib/libmesos-0.9.0.so > #19 0x00007ffff76df1a4 in process::DispatchEvent::visit(process::EventVis= itor*) const () > from /usr/local/lib/libmesos-0.9.0.so > #20 0x00007ffff738a85e in process::ProcessBase::serve(process::Event cons= t&) () > from /usr/local/lib/libmesos-0.9.0.so > #21 0x00007ffff76d7ccb in process::ProcessManager::resume(process::Proces= sBase*) () > from /usr/local/lib/libmesos-0.9.0.so > #22 0x00007ffff76cf6f7 in process::schedule(void*) () > from /usr/local/lib/libmesos-0.9.0.so > #23 0x00007ffff51fbd8c in start_thread () from /lib/x86_64-linux-gnu/libp= thread.so.0 > #24 0x00007ffff4f45fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6 > #25 0x0000000000000000 in ?? () > A debugging session is active. > I discussed with Florian the issue and did some investigations into the c= ode. It seems that the problematic section of the code has received some fa= irly major patch: > diff --git a/src/slave/process_based_isolation_module.cpp b/src/slave/pro= cess_based_isolation_module.cpp > index 7448326..b0b6a81 100644 > --- a/src/slave/process_based_isolation_module.cpp > +++ b/src/slave/process_based_isolation_module.cpp > @@ -18,6 +18,7 @@ > #include > #include > +#include // For perror. > #include > #include > @@ -150,29 +151,33 @@ void ProcessBasedIsolationModule::launchExecutor( > dispatch(slave, &Slave::executorStarted, > frameworkId, executorId, pid); > } else { > - // In child process, make cleanup easier. > + // In child process, we make cleanup easier by putting process > + // into it's own session. DO NOT USE GLOG! > + close(pipes[0]); > + > // NOTE: We setsid() in a loop because setsid() might fail if another > // process has the same process group id as the calling process. > - close(pipes[0]); > while ((pid =3D setsid()) =3D=3D -1) { > - PLOG(ERROR) << "Could not put executor in own session, " > - << "forking another process and retrying"; > + perror("Could not put executor in own session"); > + > + std::cerr << "Forking another process and retrying ..." << std::en= dl; > if ((pid =3D fork()) =3D=3D -1) { > - LOG(ERROR) << "Failed to fork to launch executor"; > - exit(-1); > + perror("Failed to fork to launch executor"); > + abort(); > } > if (pid) { > // In parent process. > // It is ok to suicide here, though process reaper signals the ex= it, > // because the process isolation module ignores unknown processes= . > - exit(-1); > + exit(0); > } > } > if (write(pipes[1], &pid, sizeof(pid)) !=3D sizeof(pid)) { > - PLOG(FATAL) << "Failed to write PID on pipe"; > + perror("Failed to write PID on pipe"); > + abort(); > } > close(pipes[1]); > @@ -182,7 +187,8 @@ void ProcessBasedIsolationModule::launchExecutor( > executorInfo, directory); > if > ----------------------------------------- > We are a bit with our backs against the wall due to the fact that the old= released Mesos 0.9 requires restarting the whole cluster in case of a mast= er failure (which we have had a few) losing all running jobs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs For more information on JIRA, see: http://www.atlassian.com/software/jira