mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Peach <jor...@gmail.com>
Subject Re: Adding a new agent terminates existing executors?
Date Wed, 15 Nov 2017 16:37:15 GMT

> On Nov 15, 2017, at 8:24 AM, Dan Leary <dll@touchplan.io> wrote:
> 
> Yes, as I said at the outset, the agents are on the same host, with different ip's and
hostname's and work_dir's.
> If having separate work_dirs is not sufficient to keep containers separated by agent,
what additionally is required?

You might also need to specify other separate agent directories, like --runtime_dir, --docker_volume_checkpoint_dir,
etc. Check the output of mesos-agent --flags.

> 
> 
> On Wed, Nov 15, 2017 at 11:13 AM, Vinod Kone <vinodkone@apache.org> wrote:
> How is agent2 able to see agent1's containers? Are they running on the same box!? Are
they somehow sharing the filesystem? If yes, that's not supported.
> 
> On Wed, Nov 15, 2017 at 8:07 AM, Dan Leary <dll@touchplan.io> wrote:
> Sure, master log and agent logs are attached.
> 
> Synopsis:  In the master log, tasks t000001 and t000002 are running...
> 
> > I1114 17:08:15.972033  5443 master.cpp:6841] Status update TASK_RUNNING (UUID: 9686a6b8-b04d-4bc5-9d26-32d50c7b0f74)
for task t000001 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 10aa0208-4a85-466c-af89-7e73617516f5-S0
at slave(1)@127.1.1.1:5051 (agent1)
> > I1114 17:08:19.142276  5448 master.cpp:6841] Status update TASK_RUNNING (UUID: a6c72f31-2e47-4003-b707-9e8c4fb24f05)
for task t000002 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 10aa0208-4a85-466c-af89-7e73617516f5-S0
at slave(1)@127.1.1.1:5051 (agent1)
> 
> Operator starts up agent2 around 17:08:50ish.  Executor1 and its tasks are terminated....
> 
> > I1114 17:08:54.835841  5447 master.cpp:6964] Executor 'executor1' of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001
on agent 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1): terminated
with signal Killed
> > I1114 17:08:54.835959  5447 master.cpp:9051] Removing executor 'executor1' with
resources [] of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on agent 10aa0208-4a85-466c-af89-7e73617516f5-S0
at slave(1)@127.1.1.1:5051 (agent1)
> > I1114 17:08:54.837419  5436 master.cpp:6841] Status update TASK_FAILED (UUID: d6697064-6639-4d50-b88e-65b3eead182d)
for task t000001 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 10aa0208-4a85-466c-af89-7e73617516f5-S0
at slave(1)@127.1.1.1:5051 (agent1)
> > I1114 17:08:54.837497  5436 master.cpp:6903] Forwarding status update TASK_FAILED
(UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t000001 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001
> > I1114 17:08:54.837896  5436 master.cpp:8928] Updating the state of task t000001
of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest state: TASK_FAILED, status
update state: TASK_FAILED)
> > I1114 17:08:54.839159  5436 master.cpp:6841] Status update TASK_FAILED (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0)
for task t000002 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 10aa0208-4a85-466c-af89-7e73617516f5-S0
at slave(1)@127.1.1.1:5051 (agent1)
> > I1114 17:08:54.839221  5436 master.cpp:6903] Forwarding status update TASK_FAILED
(UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t000002 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001
> > I1114 17:08:54.839493  5436 master.cpp:8928] Updating the state of task t000002
of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest state: TASK_FAILED, status
update state: TASK_FAILED)
> 
> But agent2 doesn't register until later...
> 
> > I1114 17:08:55.588762  5442 master.cpp:5714] Received register agent message from
slave(1)@127.1.1.2:5052 (agent2)
> 
> Meanwhile in the agent1 log, the termination of executor1 appears to be the result of
the destruction of its container...
> 
> > I1114 17:08:54.810638  5468 containerizer.cpp:2612] Container cbcf6992-3094-4d0f-8482-4d68f68eae84
has exited
> > I1114 17:08:54.810732  5468 containerizer.cpp:2166] Destroying container cbcf6992-3094-4d0f-8482-4d68f68eae84
in RUNNING state
> > I1114 17:08:54.810761  5468 containerizer.cpp:2712] Transitioning the state of container
cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to DESTROYING
> 
> Apparently because agent2 decided to "recover" the very same container...
> 
> > I1114 17:08:54.775907  6041 linux_launcher.cpp:373] cbcf6992-3094-4d0f-8482-4d68f68eae84
is a known orphaned container
> > I1114 17:08:54.779634  6037 containerizer.cpp:966] Cleaning up orphan container
cbcf6992-3094-4d0f-8482-4d68f68eae84
> > I1114 17:08:54.779705  6037 containerizer.cpp:2166] Destroying container cbcf6992-3094-4d0f-8482-4d68f68eae84
in RUNNING state
> > I1114 17:08:54.779737  6037 containerizer.cpp:2712] Transitioning the state of container
cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to DESTROYING
> > I1114 17:08:54.780740  6041 linux_launcher.cpp:505] Asked to destroy container cbcf6992-3094-4d0f-8482-4d68f68eae84
> 
> Seems like an issue with the containerizer?
> 
> 
> On Tue, Nov 14, 2017 at 4:46 PM, Vinod Kone <vinodkone@apache.org> wrote:
> That seems weird then. A new agent coming up on a new ip and host, shouldn't affect other
agents running on different hosts. Can you share master logs that surface the issue?
> 
> On Tue, Nov 14, 2017 at 12:51 PM, Dan Leary <dll@touchplan.io> wrote:
> Just one mesos-master (no zookeeper) with --ip=127.0.0.1 --hostname=localhost.
> In /etc/hosts are 
>   127.1.1.1    agent1
>   127.1.1.2    agent2
> etc. and mesos-agent gets passed --ip=127.1.1.1 --hostname=agent1 etc.
> 
> 
> On Tue, Nov 14, 2017 at 3:41 PM, Vinod Kone <vinodkone@apache.org> wrote:
> ```Experiments thus far are with a cluster all on a single host, master on 127.0.0.1,
agents have their own ip's and hostnames and ports.```
> 
> What does this mean? How are all your masters and agents on the same host but still get
different ips and hostnames?
> 
> 
> On Tue, Nov 14, 2017 at 12:22 PM, Dan Leary <dll@touchplan.io> wrote:
> So I have a bespoke framework that runs under 1.4.0 using the v1 HTTP API, custom executor,
checkpointing disabled.
> When the framework is running happily and a new agent is added to the cluster all the
existing executors immediately get terminated.
> The scheduler is told of the lost executors and tasks and then receives offers about
agents old and new and carries on normally.
> 
> I would expect however that the existing executors should keep running and the scheduler
should just receive offers about the new agent.
> It's as if agent recovery is being performed when the new agent is launched even though
no old agent has exited.
> Experiments thus far are with a cluster all on a single host, master on 127.0.0.1, agents
have their own ip's and hostnames and ports.
> 
> Am I missing a configuration parameter?   Or is this correct behavior?
> 
> -Dan
> 
> 
> 
> 
> 
> 
> 


Mime
View raw message