Return-Path: X-Original-To: apmail-mesos-issues-archive@minotaur.apache.org Delivered-To: apmail-mesos-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F3281176BC for ; Tue, 10 Mar 2015 16:27:12 +0000 (UTC) Received: (qmail 83510 invoked by uid 500); 10 Mar 2015 16:26:38 -0000 Delivered-To: apmail-mesos-issues-archive@mesos.apache.org Received: (qmail 83475 invoked by uid 500); 10 Mar 2015 16:26:38 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 83465 invoked by uid 99); 10 Mar 2015 16:26:38 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Mar 2015 16:26:38 +0000 Date: Tue, 10 Mar 2015 16:26:38 +0000 (UTC) From: "Joerg Schad (JIRA)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MESOS-2419) Slave recovery not recovering tasks MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14355150#comment-14355150 ] Joerg Schad commented on MESOS-2419: ------------------------------------ Prepared better readable logs here: https://docs.google.com/document/d/1TxDG9UwYGSVUjmPFBO9b5304HecS5I0SvB8OfUvi8WY/edit# > Slave recovery not recovering tasks > ----------------------------------- > > Key: MESOS-2419 > URL: https://issues.apache.org/jira/browse/MESOS-2419 > Project: Mesos > Issue Type: Bug > Components: slave > Affects Versions: 0.22.0, 0.23.0 > Reporter: Brenden Matthews > Assignee: Joerg Schad > Attachments: mesos-chronos.log.gz, mesos.log.gz > > > In a recent build from master (updated yesterday), slave recovery appears to have broken. > I'll attach the slave log (with GLOG_v=1) showing a task called `long-running-job` which is a Chronos job that just does `sleep 1h`. After restarting the slave, the task shows as `TASK_FAILED`. > Here's another case, which is for a docker task: > {noformat} > Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers > Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247207 10022 docker.cpp:468] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 20150226-230228-2931198986-5050-717-0000 > Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254791 10022 docker.cpp:1333] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has exited > Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254812 10022 docker.cpp:1159] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78' > Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container 'f2001064-e076-4978-b764-ed12a5244e78' > Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer > Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262565 10027 containerizer.cpp:353] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 20150226-230228-2931198986-5050-717-0000 > Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed > Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container f2001064-e076-4978-b764-ed12a5244e78 > Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has exited > Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266466 10022 containerizer.cpp:938] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78' > Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-0000 at executor(1)@10.81.189.232:43130 > Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597843 10024 slave.cpp:3175] Termination of executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework '20150226-230228-2931198986-5050-717-0000' failed: Container 'f2001064-e076-4978-b764-ed12a5244e78' not found > Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-0000: Not monitored > Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-0000 from @0.0.0.0:0 > Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.599093 10023 slave.cpp:2637] Failed to update resources for container f2001064-e076-4978-b764-ed12a5244e78 of executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 running task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 on status update for terminal task, destroying container: Container 'f2001064-e076-4978-b764-ed12a5244e78' not found > Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 00:09:50.599148 10024 composing.cpp:513] Container 'f2001064-e076-4978-b764-ed12a5244e78' not found > Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.599220 10024 status_update_manager.cpp:317] Received status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-0000 > Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.599256 10024 status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-0000 > Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 00:09:50.607086 10022 slave.cpp:2706] Dropping status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-0000 sent by status update manager because the slave is in RECOVERING state > Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:52.594267 10021 slave.cpp:2457] Cleaning up un-reregistered executors > Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:52.594379 10021 slave.cpp:3794] Finished recovery > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)