mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Mahler (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-8256) Libprocess can silently deadlock due to worker thread exhaustion.
Date Tue, 19 Dec 2017 22:13:00 GMT

     [ https://issues.apache.org/jira/browse/MESOS-8256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Benjamin Mahler updated MESOS-8256:
-----------------------------------
    Description: 
Currently, libprocess uses a fixed number of worker threads. This means that any code that
blocks a worker thread and requires another worker thread to unblock it can lead to deadlock
if there are sufficiently many of these to block all of the worker threads. The deadlock will
occur without any logging of it and we don't expose an endpoint for it either.

Our current approach to avoid this issue is to (1) forbid blocking a worker thread, however
there is a lot of blocking code using {{process::wait}} (the alternative is to spawn a managed
process) and other code still performs other blocking (such as {{ZooKeeper}} or custom module
code, this code could be fixed to be non-blocking), and (2) set the worker thread pool minimum
size to a known safe value. (2) is brittle and we cannot determine the minimum safe number
easily as the code evolves, and as users run module code.

Ideally:

(1) We can indicate that the deadlock occurs via a log message and also possibly an endpoint,
or even crashing! Ideally, the user can see all of the stack traces to know why the deadlock
occurred.

(2) Libprocess could keep a dynamically sized worker pool. At the very least, we could detect
deadlock and spawn additional threads to get out of it, removing these threads later. Perhaps
a simpler approach is just to spawn another temporary worker any time that a particular worker
is going to block (or inversely, decommission the worker that's about to block and have the
newly spawned worker replace it).

  was:
Currently, libprocess uses a fixed number of worker threads. This means that any code that
blocks a worker thread and requires another worker thread to unblock it can lead to deadlock
if there are sufficiently many of these to block all of the worker threads. The deadlock will
occur without any logging of it and we don't expose an endpoint for it either.

Our current approach to avoid this issue is to (1) forbid blocking a worker thread, however
there is a lot of blocking code using {{process::wait}} (the alternative is to spawn a managed
process) and other code still performs other blocking (such as {{ZooKeeper}} or custom module
code, this code could be fixed to be non-blocking), and (2) set the worker thread pool minimum
size to a known safe value. (2) is brittle and we cannot determine the minimum safe number
easily as the code evolves, and as users run module code.

Ideally:

(1) We can indicate that the deadlock occurs via a log message and also possibly an endpoint,
or even crashing! Ideally, the user can see all of the stack traces to know why the deadlock
occurred.

(2) Libprocess could keep a dynamically sized worker pool. At the very least, we could detect
deadlock and spawn additional threads to get out of it, removing these threads later. Perhaps
a simpler approach is just to spawn another worker any time that a particular worker is going
to block.


> Libprocess can silently deadlock due to worker thread exhaustion.
> -----------------------------------------------------------------
>
>                 Key: MESOS-8256
>                 URL: https://issues.apache.org/jira/browse/MESOS-8256
>             Project: Mesos
>          Issue Type: Bug
>          Components: libprocess
>            Reporter: Benjamin Mahler
>            Priority: Critical
>
> Currently, libprocess uses a fixed number of worker threads. This means that any code
that blocks a worker thread and requires another worker thread to unblock it can lead to deadlock
if there are sufficiently many of these to block all of the worker threads. The deadlock will
occur without any logging of it and we don't expose an endpoint for it either.
> Our current approach to avoid this issue is to (1) forbid blocking a worker thread, however
there is a lot of blocking code using {{process::wait}} (the alternative is to spawn a managed
process) and other code still performs other blocking (such as {{ZooKeeper}} or custom module
code, this code could be fixed to be non-blocking), and (2) set the worker thread pool minimum
size to a known safe value. (2) is brittle and we cannot determine the minimum safe number
easily as the code evolves, and as users run module code.
> Ideally:
> (1) We can indicate that the deadlock occurs via a log message and also possibly an endpoint,
or even crashing! Ideally, the user can see all of the stack traces to know why the deadlock
occurred.
> (2) Libprocess could keep a dynamically sized worker pool. At the very least, we could
detect deadlock and spawn additional threads to get out of it, removing these threads later.
Perhaps a simpler approach is just to spawn another temporary worker any time that a particular
worker is going to block (or inversely, decommission the worker that's about to block and
have the newly spawned worker replace it).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message