flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ufuk Celebi (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (FLINK-469) LocalDistributedExecutor Deadlock with Low Buffer Count
Date Wed, 18 Jun 2014 23:20:24 GMT

     [ https://issues.apache.org/jira/browse/FLINK-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Ufuk Celebi resolved FLINK-469.

       Resolution: Fixed
    Fix Version/s:     (was: pre-apache)

Fixed in [2db78a8dc1a4664f3e384005d7e07bea594b835b|https://github.com/apache/incubator-flink/commit/2db78a8dc1a4664f3e384005d7e07bea594b835b].

> LocalDistributedExecutor Deadlock with Low Buffer Count
> -------------------------------------------------------
>                 Key: FLINK-469
>                 URL: https://issues.apache.org/jira/browse/FLINK-469
>             Project: Flink
>          Issue Type: Bug
>            Reporter: GitHub Import
>              Labels: github-import
> I'm currently working on ([#25|https://github.com/stratosphere/stratosphere/issues/25]
| [FLINK-25|https://issues.apache.org/jira/browse/FLINK-25]) and discovered a possible deadlock
in the network stack, because of the buffer management in combination with the `LocalDistributedExecutor`
> The LDE starts a JobManager and multiple TaskManagers on different network ports in a
single VM. Every TaskManager has an associated `ByteBufferedChannelManager` (single instance)
and `GlobalBufferPool` (singleton) for data transfers. When tasks get registered with a TaskManager
(which is atomic per TaskManager), the ChannelManager ensures that there are enough network
buffers available to execute the task -- this means that there has to be at least one buffer
per task channel. If this condition does not hold, an exception is thrown and the task fails.
This decision is made locally per task and not for the whole plan, e.g. for WordCount it is
possible that all map tasks get enough buffers, but a following reduce throws an exception
at runtime.
> The problem occurs in combination with the LDE: we have multiple TMs with their ChannelManager
instances, but only a singleton GlobalBufferPool. This results in a problem with the available
buffer computation, because each TM justs considers its local channels (registered at the
ChannelManager) and not the channels of others TMs (which is perfectly fine in a real distributed
setup). Therefore, it is possible for tasks to deadlock, because of missing buffers (buffer
requests are blocking).
> You are likely to reproduce this problem by running `LocalDistributedExecutorTest` and
setting the number of buffers to 20 and the buffer size to 4096 bytes (see `ConfigConstants`;
make also sure to set `multicastEnabled` in ByteBufferedChannelManager to `false`, because
it influences the computation -- multicast does not work anyways).
> I will fix this with the upcoming PR for ([#25|https://github.com/stratosphere/stratosphere/issues/25]
| [FLINK-25|https://issues.apache.org/jira/browse/FLINK-25]).
> ---------------- Imported from GitHub ----------------
> Url: https://github.com/stratosphere/stratosphere/issues/469
> Created by: [uce|https://github.com/uce]
> Labels: bug, runtime, 
> Assignee: [uce|https://github.com/uce]
> Created at: Wed Feb 12 13:58:36 CET 2014
> State: open

This message was sent by Atlassian JIRA

View raw message