flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maximilian Michels <...@apache.org>
Subject Re: Flink-1.0.0 JobManager is not running in Docker Container on AWS
Date Mon, 14 Mar 2016 08:37:25 GMT
Hi Deepak,

We'll look more into this problem this week. Until now we considered it a
configuration issue if the bind address was not externally reachable.
However, one might not always have the possibility to change this network
configuration.

Looking further, it is actually possible to let the bind address be
different from the advertised address. From the Akka FAQ at
http://doc.akka.io/docs/akka/2.4.1/additional/faq.html:

If you are running an ActorSystem under a NAT or inside a docker container,
> make sure to set akka.remote.netty.tcp.hostname and
> akka.remote.netty.tcp.port to the address it is reachable at from other
> ActorSystems. If you need to bind your network interface to a different
> address - use akka.remote.netty.tcp.bind-hostname and
> akka.remote.netty.tcp.bind-port settings. Also make sure your network is
> configured to translate from the address your ActorSystem is reachable at
> to the address your ActorSystem network interface is bound to.
>

It looks like we have to expose this configuration to users who have a
special network setup.

Best,
Max

On Mon, Mar 14, 2016 at 5:42 AM, Deepak Jha <dkjhanitt@gmail.com> wrote:

> Hi Stephan & Ufuk,
> Thanks for your response.
>
> Yes there is a way in which you can run docker (net = host mode) in which
> guest machine's network stack gets shared by docker container.
> Unfortunately its not supported by AWS ECS.
>
> I do have one more question for you. Can you guys please explain me what
> happens when taskmanager's register themselves to jobmanager in HA mode?
> Does each taskmanager gets connected to jobmanager on separate port ? The
> reason I'm asking is because if I run 2 taskmanager's (on separate docker
> container), they are able to attach themselves to the Jobmanager (another
> docker container) ( Flink HA setup using remote zk cluster) but soon after
> that they get disconnected. Logs are not very helpful either... I suspect
> that each taskmanager gets connected on new port and since by default
> docker does not expose all ports, this may happen.... I do not see this
> happen when I do not use docker container....
>
> Here is the log file that I saw in jobmanager....
>
> 2016-03-12 08:55:55,010 PST [INFO]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-20] o.a.f.r.instance.InstanceManager -
> Registered TaskManager at 5673db03e679 (akka.tcp://
> flink@172.17.0.3:6121/user/taskmanager) as
> 7eafcfddd6bd084f2ec5a32594603f4f. Current number of registered hosts
> is 1. *Current
> number of alive task slots is 1.*
> 2016-03-12 08:57:42,676 PST [INFO]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-20] o.a.f.r.instance.InstanceManager -
> Registered TaskManager at 7200a7da4da7 (akka.tcp://
> flink@172.17.0.3:6121/user/taskmanager) as
> 320338e15a7a44ee64dc03a40f04fcd7. Current number of registered hosts
> is 2. *Current
> number of alive task slots is 2.*
> 2016-03-12 08:57:48,422 PST [INFO]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-20]
> o.a.f.runtime.jobmanager.JobManager - Task manager akka.tcp://
> flink@172.17.0.3:6121/user/taskmanager terminated.
> 2016-03-12 08:57:48,422 PST [INFO]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-20] o.a.f.r.instance.InstanceManager
> -*
> Unregistered task manager akka.tcp://
> flink@172.17.0.3:6121/user/taskmanager
> <http://flink@172.17.0.3:6121/user/taskmanager>. Number of registered task
> managers 1. Number of available slots 1.*
> 2016-03-12 08:58:01,417 PST [WARN]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-20]
> a.remote.ReliableDeliverySupervisor - Association with remote system
> [akka.tcp://flink@172.17.0.3:6121] has failed, address is now gated for
> [5000] ms. Reason is: [Disassociated].
> 2016-03-12 08:58:01,451 PST [INFO]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-20]
> o.a.f.runtime.jobmanager.JobManager - Task manager akka.tcp://
> flink@172.17.0.3:6121/user/taskmanager wants to disconnect, because
> TaskManager akka://flink/user/taskmanager is disassociating.
> 2016-03-12 08:58:01,451 PST [INFO]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-20]
> o.a.f.r.instance.InstanceManager - *Unregistered
> task manager akka.tcp://flink@172.17.0.3:6121/user/taskmanager
> <http://flink@172.17.0.3:6121/user/taskmanager>. Number of registered task
> managers 0. Number of available slots 0.*
> 2016-03-12 08:58:01,465 PST [INFO]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-20]
> o.a.f.r.instance.InstanceManager - *Registered
> TaskManager at 7200a7da4da7
> (akka.tcp://flink@172.17.0.3:6121/user/taskmanager
> <http://flink@172.17.0.3:6121/user/taskmanager>) as
> b5dbbc829854afa3ec5d8f0b6f9dbd03. Current number of registered hosts is 1.
> Current number of alive task slots is 1.*
> 2016-03-12 08:58:03,383 PST [INFO]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-20]
> o.a.f.runtime.jobmanager.JobManager - Task manager akka.tcp://
> flink@172.17.0.3:6121/user/taskmanager terminated.
> 2016-03-12 08:58:03,384 PST [INFO]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-20] o.a.f.r.instance.InstanceManager
> -*
> Unregistered task manager akka.tcp://
> flink@172.17.0.3:6121/user/taskmanager
> <http://flink@172.17.0.3:6121/user/taskmanager>. Number of registered task
> managers 0. Number of available slots 0.*
> 2016-03-12 08:58:04,988 PST [INFO]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-20] o.a.f.r.instance.InstanceManager -
> Registering TaskManager at akka.tcp://
> flink@172.17.0.3:6121/user/taskmanager
> which was marked as dead earlier because of a heart-beat timeout.
> 2016-03-12 08:58:04,988 PST [INFO]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-20] o.a.f.r.instance.InstanceManager -
> Registered TaskManager at 7200a7da4da7 (akka.tcp://
> flink@172.17.0.3:6121/user/taskmanager) as
> eac0ce12e6ec885863d3438d691f4ab2. Current number of registered hosts is 1.
> Current number of alive task slots is 1.
> 2016-03-12 08:58:21,382 PST [WARN]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-20]
> a.remote.ReliableDeliverySupervisor - Association with remote system
> [akka.tcp://flink@172.17.0.3:6121] has failed, address is now gated for
> [5000] ms. Reason is: [Disassociated].
> 2016-03-12 08:58:21,388 PST [INFO]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-20]
> o.a.f.runtime.jobmanager.JobManager - Task manager akka.tcp://
> flink@172.17.0.3:6121/user/taskmanager wants to disconnect, because
> TaskManager akka://flink/user/taskmanager is disassociating.
> 2016-03-12 08:58:21,388 PST [INFO]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-20] o.a.f.r.instance.InstanceManager -
> Unregistered task manager akka.tcp://
> flink@172.17.0.3:6121/user/taskmanager.
> Number of registered task managers 0. Number of available slots 0.
> 2016-03-12 08:58:21,390 PST [INFO]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-20] o.a.f.r.instance.InstanceManager -
> Registered TaskManager at 7200a7da4da7 (akka.tcp://
> flink@172.17.0.3:6121/user/taskmanager) as
> bda61dbd047d40889aa3868d5d4d86a9. Current number of registered hosts is 1.
> Current number of alive task slots is 1.
> 2016-03-12 08:58:25,433 PST [INFO]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-18]
> o.a.f.runtime.jobmanager.JobManager - Task manager akka.tcp://
> flink@172.17.0.3:6121/user/taskmanager terminated.
> 2016-03-12 08:58:25,434 PST [INFO]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-18] o.a.f.r.instance.InstanceManager -
> Unregistered task manager akka.tcp://
> flink@172.17.0.3:6121/user/taskmanager.
> Number of registered task managers 0. Number of available slots 0.
> 2016-03-12 08:58:28,947 PST [INFO]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-20] o.a.f.r.instance.InstanceManager -
> Registering TaskManager at akka.tcp://
> flink@172.17.0.3:6121/user/taskmanager
> which was marked as dead earlier because of a heart-beat timeout.
> 2016-03-12 08:58:28,948 PST [INFO]  ec2-54-173-231-120.compute-1.a
> [flink-akka.actor.default-dispatcher-20] o.a.f.r.instance.InstanceManager -
> Registered TaskManager at 7200a7da4da7 (akka.tcp://
> flink@172.17.0.3:6121/user/taskmanager) as
> d42ea5c6e0053935a0973d8536f3d8a5. Current number of registered hosts is 1.
> Current number of alive task slots is 1.
>
>
> On Fri, Mar 11, 2016 at 5:23 AM, Stephan Ewen <sewen@apache.org> wrote:
>
> > Hi Deepak!
> >
> > We can currently not split the bind address and advertised address,
> because
> > the Akka library only accepts packages sent explicitly to the bind
> address
> > (not sure why Akka has this artificial limitation, but it is there).
> >
> > Can you bridge the container IP address to be visible from the outside?
> >
> > Stephan
> >
> >
> > On Fri, Mar 11, 2016 at 1:03 PM, Ufuk Celebi <uce@apache.org> wrote:
> >
> > > Hey Deepak!
> > >
> > > Your description of Flink's behaviour is correct. To summarize:
> > >
> > > # Host Address
> > >
> > > If you specify a host address as an argument to the JVM (via
> > > jobmanager.sh or the start-cluster.sh scripts) then that one is used.
> > > If you don't, it falls back to the value configured in flink-conf.yaml
> > > (what you describe).
> > >
> > > # Ports
> > >
> > > Default used random port and publishes via ZooKeeper. You can
> > > configure a port range only via recovery.jobmanager.port (what you
> > > describe).
> > >
> > > ---
> > >
> > > Your proposal would likely solve the issue, but isn't it possible to
> > > handle this outside of Flink? I've found this stack overflow question,
> > > which should be related:
> > >
> > >
> >
> http://stackoverflow.com/questions/26539727/giving-a-docker-container-a-routable-ip-address
> > >
> > > What's your opinion?
> > >
> >
>
>
>
> --
> Thanks,
> Deepak Jha
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message