Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9DD24200C1E for ; Fri, 17 Feb 2017 19:00:50 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 9C72C160B57; Fri, 17 Feb 2017 18:00:50 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C0937160B46 for ; Fri, 17 Feb 2017 19:00:49 +0100 (CET) Received: (qmail 94881 invoked by uid 500); 17 Feb 2017 18:00:49 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 94872 invoked by uid 99); 17 Feb 2017 18:00:48 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Feb 2017 18:00:48 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 8C8371A79DC for ; Fri, 17 Feb 2017 18:00:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.199 X-Spam-Level: X-Spam-Status: No, score=-1.199 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id Au1shr-KZDkz for ; Fri, 17 Feb 2017 18:00:47 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id D2FB35FC6D for ; Fri, 17 Feb 2017 18:00:46 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id D6642E0818 for ; Fri, 17 Feb 2017 18:00:45 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 19A0E2411F for ; Fri, 17 Feb 2017 18:00:45 +0000 (UTC) Date: Fri, 17 Feb 2017 18:00:45 +0000 (UTC) From: "James Peach (JIRA)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MESOS-7122) Process reaper should have a dedicated thread to avoid deadlock. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 17 Feb 2017 18:00:50 -0000 [ https://issues.apache.org/jira/browse/MESOS-7122?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1587= 2216#comment-15872216 ]=20 James Peach commented on MESOS-7122: ------------------------------------ While I agree that blocking should be avoided, the point of this bug is tha= t it is possible for the reaper to not reap. The reaper has to be able to r= eliably reap so that forward progress can be made in the unfortunate event = of code blocking on subprocesses. Running a separate thread for each {{waitpid}} seems expensive but would wo= rk. You could probably also implement this by having an event loop in {{kev= ent}} to monitor the PIDs directly, or by using {{signalfd}} on Linux to in= tercept {{SIGCHLD}} and reap any registered PIDs. > Process reaper should have a dedicated thread to avoid deadlock. > ---------------------------------------------------------------- > > Key: MESOS-7122 > URL: https://issues.apache.org/jira/browse/MESOS-7122 > Project: Mesos > Issue Type: Bug > Components: libprocess > Reporter: James Peach > > In a test environment, we saw that libprocess can deadlock when the proce= ss reaper is unable to run.=20 > This happens in the Mesos HDFS client, which synchronously runs a {{hadoo= p}} subprocess. If this happens too many times, the {{ReaperProcess}} is ne= ver scheduled to reap the subprocess statuses. Since the HDFS {{Future}} ne= ver completes, we deadlock with all the threads in the call stack below. If= there was a dedicated thread for the {{ReaperProcess}} to run on, or some = other way to endure that is is scheduled we could avoid the deadlock. > {noformat} > #0 0x00007f67b6ffc68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/l= ibpthread.so.0 > #1 0x00007f67b6da12fc in std::condition_variable::wait(std::unique_lock<= std::mutex>&) () from /usr/lib64/libstdc++.so.6 > #2 0x00007f67b8b864f6 in process::ProcessManager::wait(process::UPID con= st&) () from /usr/lib64/libmesos-1.2.0.so > #3 0x00007f67b8b8d347 in process::wait(process::UPID const&, Duration co= nst&) () from /usr/lib64/libmesos-1.2.0.so > #4 0x00007f67b8b51a85 in process::Latch::await(Duration const&) () from = /usr/lib64/libmesos-1.2.0.so > #5 0x00007f67b834fc9f in process::Future::await(Duration const&) = const () from /usr/lib64/libmesos-1.2.0.so > #6 0x00007f67b833d700 in mesos::internal::slave::fetchSize(std::basic_st= ring, std::allocator > const&, Option, std::allocator > > cons= t&) () from /usr/lib64/libmesos-1.2.0.so > #7 0x00007f67b833df5e in std::result_of, std::allocator > const&, Opt= ion, std::allocator > = > const&, mesos::SlaveID const&, mesos::internal::slave::Flags const&)::{la= mbda()#2} ()()>::type process::AsyncExecutorProcess::execute, std::allocator > const&, Option, std::= allocator > > const&, mesos::SlaveID const&, mesos::internal::slave::= Flags const&)::{lambda()#2}>(std::result_of const&, boost::disable_if, std::allocator > const&, Optio= n, std::allocator > > = const&, mesos::SlaveID const&, mesos::internal::slave::Flags const&)::{lamb= da()#2} ()()> >, void>::type*) () from /usr/lib64/libmesos-1.2.0.so > #8 0x00007f67b833a3d5 in std::_Function_handler > process::dispatch, process::AsyncExecutorProcess, mesos::internal::slave::FetcherProcess= ::fetch(mesos::ContainerID const&, mesos::CommandInfo const&, std::basic_st= ring, std::allocator > const&, Option, std::allocator > > cons= t&, mesos::SlaveID const&, mesos::internal::slave::Flags const&)::{lambda()= #2} const&, void*, {lambda()#2}, mesos::internal::slave::FetcherProcess::fe= tch(mesos::ContainerID const&, mesos::CommandInfo const&, std::basic_string= , std::allocator > const&, Option, std::allocator > > const&, = mesos::SlaveID const&, mesos::internal::slave::Flags const&)::{lambda()#2} = const&>(process::PID const&, process::Future= (process::PID::*)(mesos::internal::slave::FetcherProcess::fetch(mesos::Con= tainerID const&, mesos::CommandInfo const&, std::basic_string, std::allocator > const&, Option, std::allocator > > const&, mesos::SlaveID= const&, mesos::internal::slave::Flags const&)::{lambda()#2} const&, void*)= , {lambda()#2}, mesos::internal::slave::FetcherProcess::fetch(mesos::Contai= nerID const&, mesos::CommandInfo const&, std::basic_string, std::allocator > const&, Option, std::allocator > > const&, mesos::SlaveID co= nst&, mesos::internal::slave::Flags const&)::{lambda()#2} const&)::{lambda(= process::ProcessBase*)#1}>::_M_invoke(std::_Any_data const&, process::Proce= ssBase*) () from /usr/lib64/libmesos-1.2.0.so > #9 0x00007f67b8b85ede in process::ProcessManager::resume(process::Proces= sBase*) () from /usr/lib64/libmesos-1.2.0.so > #10 0x00007f67b8b8fc8f in std::thread::_Impl >::_M_run() () from /= usr/lib64/libmesos-1.2.0.so > #11 0x00007f67b6da1470 in ?? () from /usr/lib64/libstdc++.so.6 > #12 0x00007f67b6ff8aa1 in start_thread () from /lib64/libpthread.so.0 > #13 0x00007f67b6a3faad in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)