Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@mesos.apache.org
Date: Fri, 17 Feb 2017 18:00:45 +0000 (UTC)
From: "James Peach (JIRA)" <jira@apache.org>
To: issues@mesos.apache.org
Message-ID: <JIRA.13042706.1487025570000.112005.1487354445103@Atlassian.JIRA>
In-Reply-To: <JIRA.13042706.1487025570000@Atlassian.JIRA>
References: <JIRA.13042706.1487025570000@Atlassian.JIRA> <JIRA.13042706.1487025570918@jira-lw-us.apache.org>
Subject: [jira] [Commented] (MESOS-7122) Process reaper should have a
 dedicated thread to avoid deadlock.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Fri, 17 Feb 2017 18:00:50 -0000


    [ https://issues.apache.org/jira/browse/MESOS-7122?page=3Dcom.atlassian=
.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1587=
2216#comment-15872216 ]=20

James Peach commented on MESOS-7122:
------------------------------------

While I agree that blocking should be avoided, the point of this bug is tha=
t it is possible for the reaper to not reap. The reaper has to be able to r=
eliably reap so that forward progress can be made in the unfortunate event =
of code blocking on subprocesses.

Running a separate thread for each {{waitpid}} seems expensive but would wo=
rk. You could probably also implement this by having an event loop in {{kev=
ent}} to monitor the PIDs directly, or by using {{signalfd}} on Linux to in=
tercept {{SIGCHLD}} and reap any registered PIDs.

> Process reaper should have a dedicated thread to avoid deadlock.
> ----------------------------------------------------------------
>
>                 Key: MESOS-7122
>                 URL: https://issues.apache.org/jira/browse/MESOS-7122
>             Project: Mesos
>          Issue Type: Bug
>          Components: libprocess
>            Reporter: James Peach
>
> In a test environment, we saw that libprocess can deadlock when the proce=
ss reaper is unable to run.=20
> This happens in the Mesos HDFS client, which synchronously runs a {{hadoo=
p}} subprocess. If this happens too many times, the {{ReaperProcess}} is ne=
ver scheduled to reap the subprocess statuses. Since the HDFS {{Future}} ne=
ver completes, we deadlock with all the threads in the call stack below. If=
 there was a dedicated thread for the {{ReaperProcess}} to run on, or some =
other way to endure that is is scheduled we could avoid the deadlock.
> {noformat}
> #0  0x00007f67b6ffc68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/l=
ibpthread.so.0
> #1  0x00007f67b6da12fc in std::condition_variable::wait(std::unique_lock<=
std::mutex>&) () from /usr/lib64/libstdc++.so.6
> #2  0x00007f67b8b864f6 in process::ProcessManager::wait(process::UPID con=
st&) () from /usr/lib64/libmesos-1.2.0.so
> #3  0x00007f67b8b8d347 in process::wait(process::UPID const&, Duration co=
nst&) () from /usr/lib64/libmesos-1.2.0.so
> #4  0x00007f67b8b51a85 in process::Latch::await(Duration const&) () from =
/usr/lib64/libmesos-1.2.0.so
> #5  0x00007f67b834fc9f in process::Future<Bytes>::await(Duration const&) =
const () from /usr/lib64/libmesos-1.2.0.so
> #6  0x00007f67b833d700 in mesos::internal::slave::fetchSize(std::basic_st=
ring<char, std::char_traits<char>, std::allocator<char> > const&, Option<st=
d::basic_string<char, std::char_traits<char>, std::allocator<char> > > cons=
t&) () from /usr/lib64/libmesos-1.2.0.so
> #7  0x00007f67b833df5e in std::result_of<mesos::internal::slave::FetcherP=
rocess::fetch(mesos::ContainerID const&, mesos::CommandInfo const&, std::ba=
sic_string<char, std::char_traits<char>, std::allocator<char> > const&, Opt=
ion<std::basic_string<char, std::char_traits<char>, std::allocator<char> > =
> const&, mesos::SlaveID const&, mesos::internal::slave::Flags const&)::{la=
mbda()#2} ()()>::type process::AsyncExecutorProcess::execute<mesos::interna=
l::slave::FetcherProcess::fetch(mesos::ContainerID const&, mesos::CommandIn=
fo const&, std::basic_string<char, std::char_traits<char>, std::allocator<c=
har> > const&, Option<std::basic_string<char, std::char_traits<char>, std::=
allocator<char> > > const&, mesos::SlaveID const&, mesos::internal::slave::=
Flags const&)::{lambda()#2}>(std::result_of const&, boost::disable_if<std::=
result_of const&::is_void<std::result_of<mesos::internal::slave::FetcherPro=
cess::fetch(mesos::ContainerID const&, mesos::CommandInfo const&, std::basi=
c_string<char, std::char_traits<char>, std::allocator<char> > const&, Optio=
n<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > =
const&, mesos::SlaveID const&, mesos::internal::slave::Flags const&)::{lamb=
da()#2} ()()> >, void>::type*) () from /usr/lib64/libmesos-1.2.0.so
> #8  0x00007f67b833a3d5 in std::_Function_handler<void ()(process::Process=
Base*), process::Future<Try<Bytes, Error> > process::dispatch<Try<Bytes, Er=
ror>, process::AsyncExecutorProcess, mesos::internal::slave::FetcherProcess=
::fetch(mesos::ContainerID const&, mesos::CommandInfo const&, std::basic_st=
ring<char, std::char_traits<char>, std::allocator<char> > const&, Option<st=
d::basic_string<char, std::char_traits<char>, std::allocator<char> > > cons=
t&, mesos::SlaveID const&, mesos::internal::slave::Flags const&)::{lambda()=
#2} const&, void*, {lambda()#2}, mesos::internal::slave::FetcherProcess::fe=
tch(mesos::ContainerID const&, mesos::CommandInfo const&, std::basic_string=
<char, std::char_traits<char>, std::allocator<char> > const&, Option<std::b=
asic_string<char, std::char_traits<char>, std::allocator<char> > > const&, =
mesos::SlaveID const&, mesos::internal::slave::Flags const&)::{lambda()#2} =
const&>(process::PID<process::AsyncExecutorProcess> const&, process::Future=
 (process::PID::*)(mesos::internal::slave::FetcherProcess::fetch(mesos::Con=
tainerID const&, mesos::CommandInfo const&, std::basic_string<char, std::ch=
ar_traits<char>, std::allocator<char> > const&, Option<std::basic_string<ch=
ar, std::char_traits<char>, std::allocator<char> > > const&, mesos::SlaveID=
 const&, mesos::internal::slave::Flags const&)::{lambda()#2} const&, void*)=
, {lambda()#2}, mesos::internal::slave::FetcherProcess::fetch(mesos::Contai=
nerID const&, mesos::CommandInfo const&, std::basic_string<char, std::char_=
traits<char>, std::allocator<char> > const&, Option<std::basic_string<char,=
 std::char_traits<char>, std::allocator<char> > > const&, mesos::SlaveID co=
nst&, mesos::internal::slave::Flags const&)::{lambda()#2} const&)::{lambda(=
process::ProcessBase*)#1}>::_M_invoke(std::_Any_data const&, process::Proce=
ssBase*) () from /usr/lib64/libmesos-1.2.0.so
> #9  0x00007f67b8b85ede in process::ProcessManager::resume(process::Proces=
sBase*) () from /usr/lib64/libmesos-1.2.0.so
> #10 0x00007f67b8b8fc8f in std::thread::_Impl<std::_Bind_simple<process::P=
rocessManager::init_threads()::{unnamed type#1} ()()> >::_M_run() () from /=
usr/lib64/libmesos-1.2.0.so
> #11 0x00007f67b6da1470 in ?? () from /usr/lib64/libstdc++.so.6
> #12 0x00007f67b6ff8aa1 in start_thread () from /lib64/libpthread.so.0
> #13 0x00007f67b6a3faad in clone () from /lib64/libc.so.6
> {noformat}


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)