flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Proposal: Refactor distributed coordination to use the Akka Actor Library
Date Fri, 05 Sep 2014 11:34:31 GMT
This proposes to refactor the RPC service and the coordination between
Client, JobManager, and TaskManager to use the Akka actor library.

Even though Akka is written in Scala, it offers a Java interface and we can
use Akka completely from Java.

Below are a list of arguments why this would help the system:

Problems with the current RPC service:

  - No asynchronous calls with callbacks. This is the reason why several
parts of the runtime poll the status, introducing unnecessary latency.

  - No exception forwarding (many exceptions are simply swallowed), making
debugging and operation in flaky environments very hard

  - Limited number of handler threads. The RPC can only handle a fix number
of concurrent requests, forcing you to maintain separate thread pools to
delegate actions to

  - No support for primitive data types (or boxed primitives) as arguments,
everything has to be a specially serializable type

  - Problematic threading model. The RPC continuously spawns and terminates

Benefits of switching to the Akka actor model:

  - Akka solves all of the above issues out of the box

  - The supervisor model allows you to do failure detection of actors. That
provides a unified way of detecting and handling failures (missing
heartbeats, failed calls, ...)

  - Akka has tools to make stateful actors persistent and restart them on
other machines in cases of failure. That would greatly help in implementing
"master fail-over", which will become important

  - You can define many "call targets" (actors). Tasks (on taskmanagers)
can directly call their ExecutionVertex on the JobManager, rather than
calling the JobManager, creating a Runnable that looks up the execution
vertex, and so on...

  - The actor model's approach to queue actions on an actor and run the one
after another makes the concurrency model of the state machine very simple
and robust

  - We "outsource" our own concerns about maintaining and improving that
part of the system


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message