flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: Questions concerning Akka / Documentation
Date Mon, 05 Jan 2015 18:47:30 GMT
Sounds good!

On Mon, Jan 5, 2015 at 7:14 PM, Till Rohrmann <trohrmann@apache.org> wrote:

> I addressed the issues.
>
> On Mon, Jan 5, 2015 at 3:16 PM, Stephan Ewen <sewen@apache.org> wrote:
>
> > Hi!
> >
> > Thanks for clarifying. Here are some thoughts:
> >
> > 1) The akka URL should go through a non-exposes mechanism, true. In fact,
> > using the global configuration for the local embedded mode at all seems
> to
> > be a bad design that we should get rid of.
>
>
> Sooner or later we should also get rid of the global configuration.
>
>
> >
> > 2) Okay, so we keep our own hearbeats in place as a means for metric
> > reports. At some point, we can avoid having the JobManager actor watch
> the
> > TaskManager actor then, it seems.
> >
>
> That is right. We can choose depending which one performs better.
>
>
> >
> > 3) re: transport failure detector - makes sense
> >
> > 4) Yes, let's have a single timeout value that defines the ask timeout,
> tcp
> > timeout, and the interval of the watch failure detector, and allow to
> > override them by specifying the options.
> >
>
> I chose the following heuristics for the moment:
>
> ask timeout = tcp timeout = startup timeout = death watch pause = 10 *
> interval of death watch
>
> We can see how they behave and if necessary adapt.
>
>
> >
> > Stephan
> >
> >
> > On Mon, Jan 5, 2015 at 10:30 AM, Till Rohrmann <trohrmann@apache.org>
> > wrote:
> >
> > > Hi,
> > >
> > > you are right, the new implementation still lacks a lot of
> documentation
> > > which makes understanding the code harder than necessary.
> > >
> > > On Sun, Jan 4, 2015 at 10:28 PM, Stephan Ewen <sewen@apache.org>
> wrote:
> > >
> > > > Hi!
> > > >
> > > > Since the new distributed infrastructure is built on Akka, some
> > internal
> > > > concepts have changed now.
> > > > I think that this is currently not really document anywhere
> > > >
> > > > @Till Can you elaborate on the questions here:
> > > >
> > > >  - What is the Akka URL in the global configuration
> > > ("jobmanager.akka.url")
> > > > From the perspective of the global configuration, don't we simply
> have
> > > the
> > > > address and port of the actor system?
> > > >
> > >
> > > The jobmanager.akka.url is used to overwrite the default akka url
> > > generation which is akka.tcp://${HOSTNAME}:${PORT}. This is necessary
> in
> > > cases where we do not have remote actor systems but a single local, as
> in
> > > the case of local execution, and thus have to use a different url
> scheme.
> > > In case of a single actor system, the url would be
> > > akka://${ACTORSYSTEMNAME}. So in fact this configuration option is only
> > > used internally and should not be configured by the user. To make it
> > > fail-safe we should probably use a non exposed mechanism.
> > >
> > >
> > > >
> > > >  - We currently have multiple competing failure-detection mechanisms:
> > For
> > > > one, the job manager actor watches the task manager actors. Also, we
> > > still
> > > > have the manual heart beats in place. Shouldn't we remove the old
> > manual
> > > > heartbeats and have the instance manager watch the task manager
> actors?
> > > >
> > >
> > > It's right that we still have the old heartbeats in place but they are
> > > stripped down. Currently, they are only used to update the
> > > lastReceivedHeartBeat field in the Instance object. Consequently, they
> > > could be simply removed at the price of not getting shown the time
> since
> > > the last heartbeat in the web interface. The failure detection
> mechanism
> > is
> > > currently realized exclusively by using Akka's death watch, meaning
> that
> > > the JobManager watches the TaskManagers and vice versa. I also thought
> > that
> > > some people wanted to piggy back on the heartbeat message to do
> > monitoring.
> > > Therefore I kept it for the moment. But I guess that a dedicated
> > monitoring
> > > message would be better.
> > >
> > >
> > > >  - There are transport heartbeats and watch heartbeats. I could not
> > find
> > > a
> > > > good explanation of what the transport heartbeats are. Also, the
> > > heartbeat
> > > > interval is very large (1000 s) by default, so I am wondering what
> > there
> > > > purpose is.
> > > >
> > >
> > > Yes you're right that Akka has a lot of little knobs to turn and twist
> > and
> > > some of them are more obvious than others. The transport failure
> detector
> > > is Akka's own mechanism to detect lost messages. This is necessary for
> > UDP
> > > but not for TCP since it has its own failure detector. In order to
> > decrease
> > > the unnecessary network traffic, I set the heartbeat pause and
> heartbeat
> > > interval of the transport failure detector to these high numbers.
> > >
> > >
> > > >
> > > >  - There are many different timeouts:
> > > >    -> startup timeout
> > > >
> > >
> > > That is the timeout for creating an actor system.
> > >
> > >
> > > >    -> watch heartbeat timeout
> > > >
> > >
> > > This timeout is used for the death watch. But the detector is actually
> > > controlled by akka.watch.heartbeat.interval, akka.watch.heartbeat.pause
> > and
> > > akka.watch.threshold. In [1] it is described what these parameters do.
> > >
> > >
> > > >    -> ask timeout
> > > >
> > >
> > > That is the general timeout which is used for all futures once the
> actor
> > > system has been started.
> > >
> > >
> > > >    -> TCP timeout
> > > >
> > >
> > > The TCP timeout is the timeout which is used by Netty for all outbound
> > > connections.
> > >
> > >
> > > >   How to the relate / interact? Does it make sense to define them
> > > relative
> > > > to one another?
> > >
> > >
> > > For the sake of simplicity and usability, it is a good idea to derive
> the
> > > individual timeouts by means of some heuristics from a single timeout
> > > value. Maybe we could use these heuristics as default values but still
> > > allow the user to define these values himself if he wants to.
> > >
> > >
> > > >
> > > > I think it makes a lot of sense to document these points somewhere.
> > > >
> > >
> > > I'll add an overview and details of the implementation to the internals
> > > section of the documentation.
> > >
> > >
> > > >
> > > > Greetings,
> > > > Stephan
> > > >
> > >
> > > [1]
> > >
> > >
> >
> http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#watching-remote-actors
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message