flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <trohrm...@apache.org>
Subject Re: Questions concerning Akka / Documentation
Date Mon, 05 Jan 2015 09:30:52 GMT
Hi,

you are right, the new implementation still lacks a lot of documentation
which makes understanding the code harder than necessary.

On Sun, Jan 4, 2015 at 10:28 PM, Stephan Ewen <sewen@apache.org> wrote:

> Hi!
>
> Since the new distributed infrastructure is built on Akka, some internal
> concepts have changed now.
> I think that this is currently not really document anywhere
>
> @Till Can you elaborate on the questions here:
>
>  - What is the Akka URL in the global configuration ("jobmanager.akka.url")
> From the perspective of the global configuration, don't we simply have the
> address and port of the actor system?
>

The jobmanager.akka.url is used to overwrite the default akka url
generation which is akka.tcp://${HOSTNAME}:${PORT}. This is necessary in
cases where we do not have remote actor systems but a single local, as in
the case of local execution, and thus have to use a different url scheme.
In case of a single actor system, the url would be
akka://${ACTORSYSTEMNAME}. So in fact this configuration option is only
used internally and should not be configured by the user. To make it
fail-safe we should probably use a non exposed mechanism.


>
>  - We currently have multiple competing failure-detection mechanisms: For
> one, the job manager actor watches the task manager actors. Also, we still
> have the manual heart beats in place. Shouldn't we remove the old manual
> heartbeats and have the instance manager watch the task manager actors?
>

It's right that we still have the old heartbeats in place but they are
stripped down. Currently, they are only used to update the
lastReceivedHeartBeat field in the Instance object. Consequently, they
could be simply removed at the price of not getting shown the time since
the last heartbeat in the web interface. The failure detection mechanism is
currently realized exclusively by using Akka's death watch, meaning that
the JobManager watches the TaskManagers and vice versa. I also thought that
some people wanted to piggy back on the heartbeat message to do monitoring.
Therefore I kept it for the moment. But I guess that a dedicated monitoring
message would be better.


>  - There are transport heartbeats and watch heartbeats. I could not find a
> good explanation of what the transport heartbeats are. Also, the heartbeat
> interval is very large (1000 s) by default, so I am wondering what there
> purpose is.
>

Yes you're right that Akka has a lot of little knobs to turn and twist and
some of them are more obvious than others. The transport failure detector
is Akka's own mechanism to detect lost messages. This is necessary for UDP
but not for TCP since it has its own failure detector. In order to decrease
the unnecessary network traffic, I set the heartbeat pause and heartbeat
interval of the transport failure detector to these high numbers.


>
>  - There are many different timeouts:
>    -> startup timeout
>

That is the timeout for creating an actor system.


>    -> watch heartbeat timeout
>

This timeout is used for the death watch. But the detector is actually
controlled by akka.watch.heartbeat.interval, akka.watch.heartbeat.pause and
akka.watch.threshold. In [1] it is described what these parameters do.


>    -> ask timeout
>

That is the general timeout which is used for all futures once the actor
system has been started.


>    -> TCP timeout
>

The TCP timeout is the timeout which is used by Netty for all outbound
connections.


>   How to the relate / interact? Does it make sense to define them relative
> to one another?


For the sake of simplicity and usability, it is a good idea to derive the
individual timeouts by means of some heuristics from a single timeout
value. Maybe we could use these heuristics as default values but still
allow the user to define these values himself if he wants to.


>
> I think it makes a lot of sense to document these points somewhere.
>

I'll add an overview and details of the implementation to the internals
section of the documentation.


>
> Greetings,
> Stephan
>

[1]
http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#watching-remote-actors

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message