mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Wu <jos...@mesosphere.io>
Subject Re: framework failover
Date Fri, 04 Nov 2016 18:03:05 GMT
A couple questions/notes:

What do you mean by:

> the system will deploy the framework on a new node within less than three
> minutes.

Are you running your frameworks via Marathon?

How are you terminating the Mesos Agent?  If you send a `kill -SIGUSR1`,
the agent will immediately kill all of its tasks and un-register with the
master.
If you kill the agent with some other signal, the agent will simply stop,
but tasks will continue to run.

According to the mesos GUI page cassandra holds 99-100 % of the resources
> on the terminated slave during that 14 minutes.

^ Implies that the master does not remove the agent immediately, meaning
you killed the agent, but did not kill the tasks.
During this time, the master is waiting for the agent to come back online.
If the agent doesn't come back during some (configurable) timeout, it will
notify the frameworks about the loss of an agent.

Also, it's a little odd that your frameworks will disconnect upon the agent
process dying.  You may want to investigate your framework dependencies.  A
framework should definitely not depend on the agent process (frameworks
depend on the master though).



On Fri, Nov 4, 2016 at 10:32 AM, Jaana Miettinen <jaanam@kolumbus.fi> wrote:

> Hi, Would you help me to find out how the framework failover happens in
> mesos 0.28.0 ?
>
>
>
> In my mesos-environment I have the following  frameworks:
>
>
>
> etcd-mesos
>
> cassandra-mesos 0.2.0-1
>
> eremitic
>
> marathon 0.15.2
>
>
>
> If I shutdown the agent (mesos-slave) in which my framework has been
> deployed from the Linux command-line by ‘halt’-command, the sytem will
> deploy the framework on a new node within less than three minutes.
>
>
>
> But when I shut down the agent in which cassandra framework is running it
> takes 14 minutes before the system recovers.
>
>
>
> According to the mesos GUI page cassandra holds 99-100 % of the resources
> on the terminated slave during that 14 minutes.
>
>
>
> Seen from the mesos-log:
>
>
>
> Line 976: I1104 08:53:29.516564 15502 master.cpp:1173] Slave
> c002796f-a98d-4e55-bee3-f51b8d06323b-S8 at slave(1)@10.254.69.140:5050
> (mesos-slave-1) disconnected
>
>                              Line 977: I1104 08:53:29.516644 15502
> master.cpp:2586] Disconnecting slave c002796f-a98d-4e55-bee3-f51b8d06323b-S8
> at slave(1)@10.254.69.140:5050 (mesos-slave-1)
>
>                              Line 1020: I1104 08:53:39.872681 15501
> master.cpp:1212] Framework c002796f-a98d-4e55-bee3-f51b8d06323b-0007
> (Eremetic) at scheduler(1)@10.254.69.140:31570 disconnected
>
>                              Line 1021: I1104 08:53:39.872707 15501
> master.cpp:2527] Disconnecting framework c002796f-a98d-4e55-bee3-f51b8d06323b-0007
> (Eremetic) at scheduler(1)@10.254.69.140:31570
>
>                              Line 1080: W1104 08:54:53.621151 15503
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0007 (Eremetic) at scheduler(1)@
> 10.254.69.140:31570
>
>                              Line 1083: W1104 08:54:53.621279 15503
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0004 (Eremetic) at scheduler(1)@
> 10.254.74.77:31956
>
>                              Line 1085: W1104 08:54:53.621354 15503
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0002 (Eremetic) at scheduler(1)@
> 10.254.77.2:31460
>
>                              Line 1219: I1104 09:09:09.933365 15502
> master.cpp:1212] Framework c002796f-a98d-4e55-bee3-f51b8d06323b-0005
> (cassandra.ava) at scheduler-6849089f-1a44-4101-
> b5b7-0960da81b910@10.254.69.140:36495 disconnected
>
>                              Line 1220: I1104 09:09:09.933404 15502
> master.cpp:2527] Disconnecting framework c002796f-a98d-4e55-bee3-f51b8d06323b-0005
> (cassandra.ava) at scheduler-6849089f-1a44-4101-
> b5b7-0960da81b910@10.254.69.140:36495
>
>                              Line 1222: W1104 09:09:09.933518 15502
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0005 (cassandra.ava) at
> scheduler-6849089f-1a44-4101-b5b7-0960da81b910@10.254.69.140:36495
>
>                              Line 1223: W1104 09:09:09.933697 15502
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0005 (cassandra.ava) at
> scheduler-6849089f-1a44-4101-b5b7-0960da81b910@10.254.69.140:36495
>
>                              Line 1224: W1104 09:09:09.933768 15502
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0005 (cassandra.ava) at
> scheduler-6849089f-1a44-4101-b5b7-0960da81b910@10.254.69.140:36495
>
>                              Line 1225: W1104 09:09:09.933825 15502
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0005 (cassandra.ava) at
> scheduler-6849089f-1a44-4101-b5b7-0960da81b910@10.254.69.140:36495
>
>
>
> E1104 08:53:38.611367 15505 process.cpp:1958] Failed to shutdown socket
> with fd 38: Transport endpoint is not connected
>
> E1104 08:54:56.627190 15505 process.cpp:1958] Failed to shutdown socket
> with fd 44: Transport endpoint is not connected
>
> E1104 08:54:56.627286 15505 process.cpp:1958] Failed to shutdown socket
> with fd 38: Transport endpoint is not connected
>
> E1104 08:56:00.941144 15505 process.cpp:1958] Failed to shutdown socket
> with fd 29: Transport endpoint is not connected
>
> E1104 08:57:00.845110 15505 process.cpp:1958] Failed to shutdown socket
> with fd 32: Transport endpoint is not connected
>
> E1104 09:09:09.933151 15505 process.cpp:1958] Failed to shutdown socket
> with fd 35: Transport endpoint is not connected
>
> E1104 09:09:12.939226 15505 process.cpp:1958] Failed to shutdown socket
> with fd 32: Transport endpoint is not connected
>
>
>
> So which message did mesos try to send Cassandra at 09:09:09.933518  ?
>
>
>
> And if mesos knew that cassandra framework was running on the failed node,
> why didn’t it then disconnect it the same way as Eremetic was disconnected ?
>
>
>
> I’ve also noticed that the recovery (=resource deallocation) starts after
> cassandra’s disconnection and no resources are offered by mesos before
> that. That’s why I’m currently most interested to understand which event
> invokes Cassandra disconnect at 09:09:09.933404.
>
>
>
> Please ask for information when needed.
>
>
>
> Thanks already in advance,
>
>
>
> Jaana Miettinen
>
>
>
>
>

Mime
View raw message