mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam B (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-2605) The slave sometimes does not send active executors during reregistration
Date Fri, 10 Apr 2015 00:11:15 GMT

     [ https://issues.apache.org/jira/browse/MESOS-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Adam B updated MESOS-2605:
--------------------------
    Description: 
Framework checkpointing is enabled, and the executor successfully reregisters. However, the
tasks in that executor are LOST (by abnormal executor termination) because the executor is
removed by the mesos master as unknown, task.journalnode.journalnode.NodeExecutor.1428609184051.

See the Slave Logs here for the Task:

{code}
Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.778790 25126
status_update_manager.cpp:317] Received status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.779013 25126
status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_RUNNING (UUID:
4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051
of framework 20150408-002100-4261056010-5050-1047-0008
Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.781788 25123
slave.cpp:2753] Forwarding the update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
to master@10.142.250.253:5050
Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.781889 25123
slave.cpp:2686] Sending acknowledgement for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
to executor(1)@10.168.119.78:47638
Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.784503 25124
status_update_manager.cpp:389] Received status update acknowledgement (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.784567 25124
status_update_manager.hpp:346] Checkpointing ACK for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
{code}

Master Logs:

{code}
Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: W0409 20:19:43.008666 
1067 master.cpp:4015] Executor executor.journalnode.NodeExecutor.1428609184051 of framework
20150408-002100-4261056010-5050-1047-0008 possibly unknown to the slave 20150407-233647-2059219722-5050-1659-S5
at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)
Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.008652 
1074 hierarchical.hpp:648] Recovered cpus(*):0.1; mem(*):1536 (total allocatable: cpus(*):3.5;
mem(*):21113; disk(*):142210; ports(*):[3889-5044, 5046-5049, 2182-2958, 2960-3887, 1025-2180,
8082-9041, 9043-9159, 9161-9999, 5052-6999, 7002-7198, 7200-8079, 10001-65535]) on slave 20150407-233647-2059219722-5050-1659-S5
from framework 20150408-002100-4261056010-5050-1047-0008
Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.008712 
1067 master.cpp:4714] Removing executor 'executor.journalnode.NodeExecutor.1428609184051'
with resources cpus(*):0.1; mem(*):1536 of framework 20150408-002100-4261056010-5050-1047-0008
on slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)


Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.010372 
1067 master.cpp:3295] Status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
from slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)

Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: E0409 20:19:43.013700 
1067 master.cpp:4568] Ignoring out of order status update for task ct:1428610776000:0:mesos-slave-restart:
(TASK_FAILED -> TASK_RUNNING) of framework 20150407-233647-2059219722-5050-1659-0001
Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.013746 
1067 master.cpp:3295] Status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
from slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)
Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.013767 
1067 master.cpp:3336] Forwarding status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
{code}

  was:
Framework checkpointing is enabled, and the executor successfully reregisters. However, the
tasks in that executor are LOST (by abnormal executor termination) because the executor is
removed by the mesos master as unknown, task.journalnode.journalnode.NodeExecutor.1428609184051.

See the Slave Logs here for the Task:

{code}
Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.778790 25126
status_update_manager.cpp:317] Received status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.779013 25126
status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_RUNNING (UUID:
4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051
of framework 20150408-002100-4261056010-5050-1047-0008
Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.781788 25123
slave.cpp:2753] Forwarding the update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
to master@10.142.250.253:5050
Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.781889 25123
slave.cpp:2686] Sending acknowledgement for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
to executor(1)@10.168.119.78:47638
Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.784503 25124
status_update_manager.cpp:389] Received status update acknowledgement (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.784567 25124
status_update_manager.hpp:346] Checkpointing ACK for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008

Master Logs:

Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: W0409 20:19:43.008666 
1067 master.cpp:4015] Executor executor.journalnode.NodeExecutor.1428609184051 of framework
20150408-002100-4261056010-5050-1047-0008 possibly unknown to the slave 20150407-233647-2059219722-5050-1659-S5
at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)
Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.008652 
1074 hierarchical.hpp:648] Recovered cpus(*):0.1; mem(*):1536 (total allocatable: cpus(*):3.5;
mem(*):21113; disk(*):142210; ports(*):[3889-5044, 5046-5049, 2182-2958, 2960-3887, 1025-2180,
8082-9041, 9043-9159, 9161-9999, 5052-6999, 7002-7198, 7200-8079, 10001-65535]) on slave 20150407-233647-2059219722-5050-1659-S5
from framework 20150408-002100-4261056010-5050-1047-0008
Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.008712 
1067 master.cpp:4714] Removing executor 'executor.journalnode.NodeExecutor.1428609184051'
with resources cpus(*):0.1; mem(*):1536 of framework 20150408-002100-4261056010-5050-1047-0008
on slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)


Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.010372 
1067 master.cpp:3295] Status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
from slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)

Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: E0409 20:19:43.013700 
1067 master.cpp:4568] Ignoring out of order status update for task ct:1428610776000:0:mesos-slave-restart:
(TASK_FAILED -> TASK_RUNNING) of framework 20150407-233647-2059219722-5050-1659-0001
Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.013746 
1067 master.cpp:3295] Status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
from slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)
Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.013767 
1067 master.cpp:3336] Forwarding status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
{code}


> The slave sometimes does not send active executors during reregistration
> ------------------------------------------------------------------------
>
>                 Key: MESOS-2605
>                 URL: https://issues.apache.org/jira/browse/MESOS-2605
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.22.0
>            Reporter: Elizabeth Lingg
>              Labels: mesosphere
>
> Framework checkpointing is enabled, and the executor successfully reregisters. However,
the tasks in that executor are LOST (by abnormal executor termination) because the executor
is removed by the mesos master as unknown, task.journalnode.journalnode.NodeExecutor.1428609184051.
> See the Slave Logs here for the Task:
> {code}
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.778790
25126 status_update_manager.cpp:317] Received status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.779013
25126 status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_RUNNING (UUID:
4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051
of framework 20150408-002100-4261056010-5050-1047-0008
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.781788
25123 slave.cpp:2753] Forwarding the update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
to master@10.142.250.253:5050
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.781889
25123 slave.cpp:2686] Sending acknowledgement for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
to executor(1)@10.168.119.78:47638
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.784503
25124 status_update_manager.cpp:389] Received status update acknowledgement (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.784567
25124 status_update_manager.hpp:346] Checkpointing ACK for status update TASK_RUNNING (UUID:
4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051
of framework 20150408-002100-4261056010-5050-1047-0008
> {code}
> Master Logs:
> {code}
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: W0409 20:19:43.008666
 1067 master.cpp:4015] Executor executor.journalnode.NodeExecutor.1428609184051 of framework
20150408-002100-4261056010-5050-1047-0008 possibly unknown to the slave 20150407-233647-2059219722-5050-1659-S5
at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.008652
 1074 hierarchical.hpp:648] Recovered cpus(*):0.1; mem(*):1536 (total allocatable: cpus(*):3.5;
mem(*):21113; disk(*):142210; ports(*):[3889-5044, 5046-5049, 2182-2958, 2960-3887, 1025-2180,
8082-9041, 9043-9159, 9161-9999, 5052-6999, 7002-7198, 7200-8079, 10001-65535]) on slave 20150407-233647-2059219722-5050-1659-S5
from framework 20150408-002100-4261056010-5050-1047-0008
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.008712
 1067 master.cpp:4714] Removing executor 'executor.journalnode.NodeExecutor.1428609184051'
with resources cpus(*):0.1; mem(*):1536 of framework 20150408-002100-4261056010-5050-1047-0008
on slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.010372
 1067 master.cpp:3295] Status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
from slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: E0409 20:19:43.013700
 1067 master.cpp:4568] Ignoring out of order status update for task ct:1428610776000:0:mesos-slave-restart:
(TASK_FAILED -> TASK_RUNNING) of framework 20150407-233647-2059219722-5050-1659-0001
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.013746
 1067 master.cpp:3295] Status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
from slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.013767
 1067 master.cpp:3336] Forwarding status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640)
for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message