mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joerg Schad (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-2419) Slave recovery not recovering tasks
Date Tue, 17 Mar 2015 09:09:39 GMT

     [ https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joerg Schad updated MESOS-2419:
-------------------------------
    Description: 
{color:red}
Note: the resolution to this issue is described in the following comment below:
https://issues.apache.org/jira/browse/MESOS-2419?focusedCommentId=14357028&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14357028
{color:red}


In a recent build from master (updated yesterday), slave recovery appears to have broken.

I'll attach the slave log (with GLOG_v=1) showing a task called `long-running-job` which is
a Chronos job that just does `sleep 1h`. After restarting the slave, the task shows as `TASK_FAILED`.

Here's another case, which is for a docker task:

{noformat}
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247159 10022
docker.cpp:421] Recovering Docker containers
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247207 10022
docker.cpp:468] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321'
of framework 20150226-230228-2931198986-5050-717-0000
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254791 10022
docker.cpp:1333] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has exited
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254812 10022
docker.cpp:1159] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78'
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254844 10022
docker.cpp:1248] Running docker stop on container 'f2001064-e076-4978-b764-ed12a5244e78'
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262481 10027
containerizer.cpp:310] Recovering containerizer
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262565 10027
containerizer.cpp:353] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor
'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 20150226-230228-2931198986-5050-717-0000
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.263675 10027
linux_launcher.cpp:162] Couldn't find freezer cgroup for container f2001064-e076-4978-b764-ed12a5244e78,
assuming already destroyed
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 00:09:49.265467 10020
cpushare.cpp:199] Couldn't find cgroup for container f2001064-e076-4978-b764-ed12a5244e78
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266448 10022
containerizer.cpp:1147] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has
exited
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266466 10022
containerizer.cpp:938] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78'
Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.593585 10021
slave.cpp:3735] Sending reconnect request to executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321
of framework 20150226-230228-2931198986-5050-717-0000 at executor(1)@10.81.189.232:43130
Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597843 10024
slave.cpp:3175] Termination of executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of
framework '20150226-230228-2931198986-5050-717-0000' failed: Container 'f2001064-e076-4978-b764-ed12a5244e78'
not found
Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597949 10025
slave.cpp:3429] Failed to unmonitor container for executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321
of framework 20150226-230228-2931198986-5050-717-0000: Not monitored
Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.598785 10024
slave.cpp:2508] Handling status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289)
for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-0000
from @0.0.0.0:0
Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.599093 10023
slave.cpp:2637] Failed to update resources for container f2001064-e076-4978-b764-ed12a5244e78
of executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 running task chronos.55ffc971-be13-11e4-b8d6-566d21d75321
on status update for terminal task, destroying container: Container 'f2001064-e076-4978-b764-ed12a5244e78'
not found
Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 00:09:50.599148 10024
composing.cpp:513] Container 'f2001064-e076-4978-b764-ed12a5244e78' not found
Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.599220 10024
status_update_manager.cpp:317] Received status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289)
for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-0000
Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.599256 10024
status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289)
for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-0000
Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 00:09:50.607086 10022
slave.cpp:2706] Dropping status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289)
for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-0000
sent by status update manager because the slave is in RECOVERING state
Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:52.594267 10021
slave.cpp:2457] Cleaning up un-reregistered executors
Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:52.594379 10021
slave.cpp:3794] Finished recovery
{noformat}

  was:
Note: the resolution to this issue is described in the following comment below:
https://issues.apache.org/jira/browse/MESOS-2419?focusedCommentId=14357028&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14357028



In a recent build from master (updated yesterday), slave recovery appears to have broken.

I'll attach the slave log (with GLOG_v=1) showing a task called `long-running-job` which is
a Chronos job that just does `sleep 1h`. After restarting the slave, the task shows as `TASK_FAILED`.

Here's another case, which is for a docker task:

{noformat}
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247159 10022
docker.cpp:421] Recovering Docker containers
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247207 10022
docker.cpp:468] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321'
of framework 20150226-230228-2931198986-5050-717-0000
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254791 10022
docker.cpp:1333] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has exited
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254812 10022
docker.cpp:1159] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78'
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254844 10022
docker.cpp:1248] Running docker stop on container 'f2001064-e076-4978-b764-ed12a5244e78'
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262481 10027
containerizer.cpp:310] Recovering containerizer
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262565 10027
containerizer.cpp:353] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor
'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 20150226-230228-2931198986-5050-717-0000
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.263675 10027
linux_launcher.cpp:162] Couldn't find freezer cgroup for container f2001064-e076-4978-b764-ed12a5244e78,
assuming already destroyed
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 00:09:49.265467 10020
cpushare.cpp:199] Couldn't find cgroup for container f2001064-e076-4978-b764-ed12a5244e78
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266448 10022
containerizer.cpp:1147] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has
exited
Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266466 10022
containerizer.cpp:938] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78'
Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.593585 10021
slave.cpp:3735] Sending reconnect request to executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321
of framework 20150226-230228-2931198986-5050-717-0000 at executor(1)@10.81.189.232:43130
Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597843 10024
slave.cpp:3175] Termination of executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of
framework '20150226-230228-2931198986-5050-717-0000' failed: Container 'f2001064-e076-4978-b764-ed12a5244e78'
not found
Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597949 10025
slave.cpp:3429] Failed to unmonitor container for executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321
of framework 20150226-230228-2931198986-5050-717-0000: Not monitored
Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.598785 10024
slave.cpp:2508] Handling status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289)
for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-0000
from @0.0.0.0:0
Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.599093 10023
slave.cpp:2637] Failed to update resources for container f2001064-e076-4978-b764-ed12a5244e78
of executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 running task chronos.55ffc971-be13-11e4-b8d6-566d21d75321
on status update for terminal task, destroying container: Container 'f2001064-e076-4978-b764-ed12a5244e78'
not found
Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 00:09:50.599148 10024
composing.cpp:513] Container 'f2001064-e076-4978-b764-ed12a5244e78' not found
Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.599220 10024
status_update_manager.cpp:317] Received status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289)
for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-0000
Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.599256 10024
status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289)
for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-0000
Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 00:09:50.607086 10022
slave.cpp:2706] Dropping status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289)
for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-0000
sent by status update manager because the slave is in RECOVERING state
Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:52.594267 10021
slave.cpp:2457] Cleaning up un-reregistered executors
Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:52.594379 10021
slave.cpp:3794] Finished recovery
{noformat}


> Slave recovery not recovering tasks
> -----------------------------------
>
>                 Key: MESOS-2419
>                 URL: https://issues.apache.org/jira/browse/MESOS-2419
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>            Reporter: Brenden Matthews
>            Assignee: Joerg Schad
>         Attachments: mesos-chronos.log.gz, mesos.log.gz
>
>
> {color:red}
> Note: the resolution to this issue is described in the following comment below:
> https://issues.apache.org/jira/browse/MESOS-2419?focusedCommentId=14357028&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14357028
> {color:red}
> In a recent build from master (updated yesterday), slave recovery appears to have broken.
> I'll attach the slave log (with GLOG_v=1) showing a task called `long-running-job` which
is a Chronos job that just does `sleep 1h`. After restarting the slave, the task shows as
`TASK_FAILED`.
> Here's another case, which is for a docker task:
> {noformat}
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247159
10022 docker.cpp:421] Recovering Docker containers
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247207
10022 docker.cpp:468] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor
'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254791
10022 docker.cpp:1333] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has exited
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254812
10022 docker.cpp:1159] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254844
10022 docker.cpp:1248] Running docker stop on container 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262481
10027 containerizer.cpp:310] Recovering containerizer
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262565
10027 containerizer.cpp:353] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for
executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.263675
10027 linux_launcher.cpp:162] Couldn't find freezer cgroup for container f2001064-e076-4978-b764-ed12a5244e78,
assuming already destroyed
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 00:09:49.265467
10020 cpushare.cpp:199] Couldn't find cgroup for container f2001064-e076-4978-b764-ed12a5244e78
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266448
10022 containerizer.cpp:1147] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78'
has exited
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266466
10022 containerizer.cpp:938] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.593585
10021 slave.cpp:3735] Sending reconnect request to executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321
of framework 20150226-230228-2931198986-5050-717-0000 at executor(1)@10.81.189.232:43130
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597843
10024 slave.cpp:3175] Termination of executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321'
of framework '20150226-230228-2931198986-5050-717-0000' failed: Container 'f2001064-e076-4978-b764-ed12a5244e78'
not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597949
10025 slave.cpp:3429] Failed to unmonitor container for executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321
of framework 20150226-230228-2931198986-5050-717-0000: Not monitored
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.598785
10024 slave.cpp:2508] Handling status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289)
for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-0000
from @0.0.0.0:0
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.599093
10023 slave.cpp:2637] Failed to update resources for container f2001064-e076-4978-b764-ed12a5244e78
of executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 running task chronos.55ffc971-be13-11e4-b8d6-566d21d75321
on status update for terminal task, destroying container: Container 'f2001064-e076-4978-b764-ed12a5244e78'
not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 00:09:50.599148
10024 composing.cpp:513] Container 'f2001064-e076-4978-b764-ed12a5244e78' not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.599220
10024 status_update_manager.cpp:317] Received status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289)
for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.599256
10024 status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_FAILED (UUID:
d8afb771-a47a-4adc-b38b-c8cc016ab289) for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321
of framework 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 00:09:50.607086
10022 slave.cpp:2706] Dropping status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289)
for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-0000
sent by status update manager because the slave is in RECOVERING state
> Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:52.594267
10021 slave.cpp:2457] Cleaning up un-reregistered executors
> Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:52.594379
10021 slave.cpp:3794] Finished recovery
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message