flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <trohrm...@apache.org>
Subject Re: Questions concerning Akka / Documentation
Date Mon, 05 Jan 2015 18:14:48 GMT
I addressed the issues.

On Mon, Jan 5, 2015 at 3:16 PM, Stephan Ewen <sewen@apache.org> wrote:

> Hi!
>
> Thanks for clarifying. Here are some thoughts:
>
> 1) The akka URL should go through a non-exposes mechanism, true. In fact,
> using the global configuration for the local embedded mode at all seems to
> be a bad design that we should get rid of.


Sooner or later we should also get rid of the global configuration.


>
> 2) Okay, so we keep our own hearbeats in place as a means for metric
> reports. At some point, we can avoid having the JobManager actor watch the
> TaskManager actor then, it seems.
>

That is right. We can choose depending which one performs better.


>
> 3) re: transport failure detector - makes sense
>
> 4) Yes, let's have a single timeout value that defines the ask timeout, tcp
> timeout, and the interval of the watch failure detector, and allow to
> override them by specifying the options.
>

I chose the following heuristics for the moment:

ask timeout = tcp timeout = startup timeout = death watch pause = 10 *
interval of death watch

We can see how they behave and if necessary adapt.


>
> Stephan
>
>
> On Mon, Jan 5, 2015 at 10:30 AM, Till Rohrmann <trohrmann@apache.org>
> wrote:
>
> > Hi,
> >
> > you are right, the new implementation still lacks a lot of documentation
> > which makes understanding the code harder than necessary.
> >
> > On Sun, Jan 4, 2015 at 10:28 PM, Stephan Ewen <sewen@apache.org> wrote:
> >
> > > Hi!
> > >
> > > Since the new distributed infrastructure is built on Akka, some
> internal
> > > concepts have changed now.
> > > I think that this is currently not really document anywhere
> > >
> > > @Till Can you elaborate on the questions here:
> > >
> > >  - What is the Akka URL in the global configuration
> > ("jobmanager.akka.url")
> > > From the perspective of the global configuration, don't we simply have
> > the
> > > address and port of the actor system?
> > >
> >
> > The jobmanager.akka.url is used to overwrite the default akka url
> > generation which is akka.tcp://${HOSTNAME}:${PORT}. This is necessary in
> > cases where we do not have remote actor systems but a single local, as in
> > the case of local execution, and thus have to use a different url scheme.
> > In case of a single actor system, the url would be
> > akka://${ACTORSYSTEMNAME}. So in fact this configuration option is only
> > used internally and should not be configured by the user. To make it
> > fail-safe we should probably use a non exposed mechanism.
> >
> >
> > >
> > >  - We currently have multiple competing failure-detection mechanisms:
> For
> > > one, the job manager actor watches the task manager actors. Also, we
> > still
> > > have the manual heart beats in place. Shouldn't we remove the old
> manual
> > > heartbeats and have the instance manager watch the task manager actors?
> > >
> >
> > It's right that we still have the old heartbeats in place but they are
> > stripped down. Currently, they are only used to update the
> > lastReceivedHeartBeat field in the Instance object. Consequently, they
> > could be simply removed at the price of not getting shown the time since
> > the last heartbeat in the web interface. The failure detection mechanism
> is
> > currently realized exclusively by using Akka's death watch, meaning that
> > the JobManager watches the TaskManagers and vice versa. I also thought
> that
> > some people wanted to piggy back on the heartbeat message to do
> monitoring.
> > Therefore I kept it for the moment. But I guess that a dedicated
> monitoring
> > message would be better.
> >
> >
> > >  - There are transport heartbeats and watch heartbeats. I could not
> find
> > a
> > > good explanation of what the transport heartbeats are. Also, the
> > heartbeat
> > > interval is very large (1000 s) by default, so I am wondering what
> there
> > > purpose is.
> > >
> >
> > Yes you're right that Akka has a lot of little knobs to turn and twist
> and
> > some of them are more obvious than others. The transport failure detector
> > is Akka's own mechanism to detect lost messages. This is necessary for
> UDP
> > but not for TCP since it has its own failure detector. In order to
> decrease
> > the unnecessary network traffic, I set the heartbeat pause and heartbeat
> > interval of the transport failure detector to these high numbers.
> >
> >
> > >
> > >  - There are many different timeouts:
> > >    -> startup timeout
> > >
> >
> > That is the timeout for creating an actor system.
> >
> >
> > >    -> watch heartbeat timeout
> > >
> >
> > This timeout is used for the death watch. But the detector is actually
> > controlled by akka.watch.heartbeat.interval, akka.watch.heartbeat.pause
> and
> > akka.watch.threshold. In [1] it is described what these parameters do.
> >
> >
> > >    -> ask timeout
> > >
> >
> > That is the general timeout which is used for all futures once the actor
> > system has been started.
> >
> >
> > >    -> TCP timeout
> > >
> >
> > The TCP timeout is the timeout which is used by Netty for all outbound
> > connections.
> >
> >
> > >   How to the relate / interact? Does it make sense to define them
> > relative
> > > to one another?
> >
> >
> > For the sake of simplicity and usability, it is a good idea to derive the
> > individual timeouts by means of some heuristics from a single timeout
> > value. Maybe we could use these heuristics as default values but still
> > allow the user to define these values himself if he wants to.
> >
> >
> > >
> > > I think it makes a lot of sense to document these points somewhere.
> > >
> >
> > I'll add an overview and details of the implementation to the internals
> > section of the documentation.
> >
> >
> > >
> > > Greetings,
> > > Stephan
> > >
> >
> > [1]
> >
> >
> http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#watching-remote-actors
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message