Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B89FA200B91 for ; Thu, 29 Sep 2016 19:05:27 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B7383160AE3; Thu, 29 Sep 2016 17:05:27 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A2BAD160AC1 for ; Thu, 29 Sep 2016 19:05:26 +0200 (CEST) Received: (qmail 88577 invoked by uid 500); 29 Sep 2016 17:05:20 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 88465 invoked by uid 99); 29 Sep 2016 17:05:20 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Sep 2016 17:05:20 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 86A812C2A67 for ; Thu, 29 Sep 2016 17:05:20 +0000 (UTC) Date: Thu, 29 Sep 2016 17:05:20 +0000 (UTC) From: "Huadong Liu (JIRA)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (MESOS-4084) mesos-slave assigned marathon task wrongly to chronos framework after task failure MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 29 Sep 2016 17:05:27 -0000 [ https://issues.apache.org/jira/browse/MESOS-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15533316#comment-15533316 ] Huadong Liu edited comment on MESOS-4084 at 9/29/16 5:05 PM: ------------------------------------------------------------- In addition to marathon status update being sent wrongly to chronos, I wasn't able to find why the marathon task failed at the first place. We saw another occurrence on 09/27, very much the same as [~robjohnson] posted earlier. The agent reported failure of the marathon task and termination of the executor: {noformat} I0927 13:07:19.613260 28958 slave.cpp:3211] Handling status update TASK_FAILED (UUID: 086a001e-387f-4e38-bcf9-b2c6595cd218) for task myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 from executor(1)@10.46.1.35:40844 I0927 13:07:19.769088 28964 status_update_manager.cpp:320] Received status update TASK_FAILED (UUID: 086a001e-387f-4e38-bcf9-b2c6595cd218) for task myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 I0927 13:07:19.769268 28964 status_update_manager.cpp:825] Checkpointing UPDATE for status update TASK_FAILED (UUID: 086a001e-387f-4e38-bcf9-b2c6595cd218) for task myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 I0927 13:07:19.832613 28967 slave.cpp:3604] Forwarding the update TASK_FAILED (UUID: 086a001e-387f-4e38-bcf9-b2c6595cd218) for task myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 to master@10.44.7.101:5050 I0927 13:07:19.833287 28967 slave.cpp:3514] Sending acknowledgement for status update TASK_FAILED (UUID: 086a001e-387f-4e38-bcf9-b2c6595cd218) for task myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 to executor(1)@10.46.1.35:40844 I0927 13:07:19.953836 28961 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 086a001e-387f-4e38-bcf9-b2c6595cd218) for task myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 I0927 13:07:19.954156 28961 status_update_manager.cpp:825] Checkpointing ACK for status update TASK_FAILED (UUID: 086a001e-387f-4e38-bcf9-b2c6595cd218) for task myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 I0927 13:07:21.580893 28972 http.cpp:270] HTTP GET for /slave(1)/state from 10.46.1.35:33638 with User-Agent='python-requests/2.6.2 CPython/2.7.6 Linux/4.2.0-42-generic' I0927 13:07:22.617918 28968 slave.cpp:3732] executor(1)@10.46.1.35:40844 exited I0927 13:07:22.690866 28961 docker.cpp:2132] Executor for container '1d027a1b-e390-473d-9906-d22573810aa2' has exited I0927 13:07:22.690968 28961 docker.cpp:1852] Destroying container '1d027a1b-e390-473d-9906-d22573810aa2' I0927 13:07:22.691110 28961 docker.cpp:1980] Running docker stop on container '1d027a1b-e390-473d-9906-d22573810aa2' I0927 13:07:22.696202 28950 slave.cpp:4089] Executor 'myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc' of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 exited with status 0 I0927 13:07:22.696298 28950 slave.cpp:4193] Cleaning up executor 'myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc' of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 at executor(1)@10.46.1.35:40844 {noformat} Note that the executor terminated at 13:07:22. However, executor log shows that the executor still hang around until it reached maxConsecutiveFailures (300 seconds total) in addition to maxConsecutiveFailures (60 seconds). {noformat} I0927 13:06:46.146533 17981 health_checker.hpp:119] Ignoring failure as health check still in grace period Running with pid 17. Running with pid 17. Running with pid 17. Error response from daemon: Container mesos-8d673c2a-08ac-4074-b6c0-66f5a879b12c-S3.1d027a1b-e390-473d-9906-d22573810aa2 is not running I0927 13:07:27.109156 17976 health_checker.hpp:119] Ignoring failure as health check still in grace period Error response from daemon: Container mesos-8d673c2a-08ac-4074-b6c0-66f5a879b12c-S3.1d027a1b-e390-473d-9906-d22573810aa2 is not running I0927 13:07:37.174942 17976 health_checker.hpp:119] Ignoring failure as health check still in grace period Error response from daemon: Container mesos-8d673c2a-08ac-4074-b6c0-66f5a879b12c-S3.1d027a1b-e390-473d-9906-d22573810aa2 is not running E0927 13:07:47.241386 17987 process.cpp:2105] Failed to shutdown socket with fd 6: Transport endpoint is not connected Error response from daemon: Container mesos-8d673c2a-08ac-4074-b6c0-66f5a879b12c-S3.1d027a1b-e390-473d-9906-d22573810aa2 is not running E0927 13:07:57.303136 17987 process.cpp:2105] Failed to shutdown socket with fd 6: Transport endpoint is not connected REPEAT... E0927 13:12:39.092195 17987 process.cpp:2105] Failed to shutdown socket with fd 6: Transport endpoint is not connected W0927 13:12:40.092910 17962 main.cpp:171] Health check failed Health command check exited with status 1 END {noformat} was (Author: huadongliu): In addition to marathon status update being sent wrongly to chronos, I wasn't able to find why the marathon task failed at the first place. We saw another occurrence on 09/27, very much the same as [~robjohnson] posted earlier. The agent reported failure of the marathon task and termination of the executor: {noformat} I0927 13:07:19.613260 28958 slave.cpp:3211] Handling status update TASK_FAILED (UUID: 086a001e-387f-4e38-bcf9-b2c6595cd218) for task myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 from executor(1)@10.46.1.35:40844 I0927 13:07:19.769088 28964 status_update_manager.cpp:320] Received status update TASK_FAILED (UUID: 086a001e-387f-4e38-bcf9-b2c6595cd218) for task myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 I0927 13:07:19.769268 28964 status_update_manager.cpp:825] Checkpointing UPDATE for status update TASK_FAILED (UUID: 086a001e-387f-4e38-bcf9-b2c6595cd218) for task myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 I0927 13:07:19.832613 28967 slave.cpp:3604] Forwarding the update TASK_FAILED (UUID: 086a001e-387f-4e38-bcf9-b2c6595cd218) for task myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 to master@10.44.7.101:5050 I0927 13:07:19.833287 28967 slave.cpp:3514] Sending acknowledgement for status update TASK_FAILED (UUID: 086a001e-387f-4e38-bcf9-b2c6595cd218) for task myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 to executor(1)@10.46.1.35:40844 I0927 13:07:19.953836 28961 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 086a001e-387f-4e38-bcf9-b2c6595cd218) for task myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 I0927 13:07:19.954156 28961 status_update_manager.cpp:825] Checkpointing ACK for status update TASK_FAILED (UUID: 086a001e-387f-4e38-bcf9-b2c6595cd218) for task myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 I0927 13:07:21.580893 28972 http.cpp:270] HTTP GET for /slave(1)/state from 10.46.1.35:33638 with User-Agent='python-requests/2.6.2 CPython/2.7.6 Linux/4.2.0-42-generic' I0927 13:07:22.617918 28968 slave.cpp:3732] executor(1)@10.46.1.35:40844 exited I0927 13:07:22.690866 28961 docker.cpp:2132] Executor for container '1d027a1b-e390-473d-9906-d22573810aa2' has exited I0927 13:07:22.690968 28961 docker.cpp:1852] Destroying container '1d027a1b-e390-473d-9906-d22573810aa2' I0927 13:07:22.691110 28961 docker.cpp:1980] Running docker stop on container '1d027a1b-e390-473d-9906-d22573810aa2' I0927 13:07:22.696202 28950 slave.cpp:4089] Executor 'myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc' of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 exited with status 0 I0927 13:07:22.696298 28950 slave.cpp:4193] Cleaning up executor 'myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc' of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 at executor(1)@10.46.1.35:40844 {noformat} Note that the executor terminated at 13:07:22. However, executor log shows that the executor still hang around until it reached maxConsecutiveFailures (300 seconds total) in addition to maxConsecutiveFailures (60 seconds). {noformat} I0927 13:06:46.146533 17981 health_checker.hpp:119] Ignoring failure as health check still in grace period Running with pid 17. Running with pid 17. Running with pid 17. Error response from daemon: Container mesos-8d673c2a-08ac-4074-b6c0-66f5a879b12c-S3.1d027a1b-e390-473d-9906-d22573810aa2 is not running I0927 13:07:27.109156 17976 health_checker.hpp:119] Ignoring failure as health check still in grace period Error response from daemon: Container mesos-8d673c2a-08ac-4074-b6c0-66f5a879b12c-S3.1d027a1b-e390-473d-9906-d22573810aa2 is not running I0927 13:07:37.174942 17976 health_checker.hpp:119] Ignoring failure as health check still in grace period Error response from daemon: Container mesos-8d673c2a-08ac-4074-b6c0-66f5a879b12c-S3.1d027a1b-e390-473d-9906-d22573810aa2 is not running E0927 13:07:47.241386 17987 process.cpp:2105] Failed to shutdown socket with fd 6: Transport endpoint is not connected Error response from daemon: Container mesos-8d673c2a-08ac-4074-b6c0-66f5a879b12c-S3.1d027a1b-e390-473d-9906-d22573810aa2 is not running E0927 13:07:57.303136 17987 process.cpp:2105] Failed to shutdown socket with fd 6: Transport endpoint is not connected ...REPEAT... E0927 13:12:39.092195 17987 process.cpp:2105] Failed to shutdown socket with fd 6: Transport endpoint is not connected W0927 13:12:40.092910 17962 main.cpp:171] Health check failed Health command check exited with status 1 {noformat} > mesos-slave assigned marathon task wrongly to chronos framework after task failure > ---------------------------------------------------------------------------------- > > Key: MESOS-4084 > URL: https://issues.apache.org/jira/browse/MESOS-4084 > Project: Mesos > Issue Type: Bug > Components: slave > Affects Versions: 0.22.2 > Environment: Ubuntu 14.04.2 LTS > Mesos 0.22.2 > Marathon 0.11.0 > Chronos 2.4.0 > Reporter: Erhan Kesken > Priority: Minor > > I don't know how to reproduce problem, only thing I can do, is sharing my logs: > https://gist.github.com/ekesken/f2edfd65cca8638b0136 > These are highlights from my logs: > mesos-slave logs: > {noformat} > Dec 7 08:58:27 mesos-slave-node-012 mesos-slave[56099]: I1207 08:58:27.089156 56130 slave.cpp:2531] Handling status update TASK_FAILED (UUID: 5b335fab-1722-4270-83a6-b4ec843be47f) for task collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 of framework 20151113-112010-100670892-5050-7957-0001 from executor(1)@172.29.1.12:1651 > 08:58:27 mesos-slave-node-012 mesos-slave[56099]: E1207 08:58:27.089874 56128 slave.cpp:2662] Failed to update resources for container ed5f4f67-464d-4786-9628-cd48732de6b7 of executor collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 running task collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 on status update for terminal task, destroying container: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/34074/cgroup: Failed to open file '/proc/34074/cgroup': No such file or directory > {noformat} > notice the framework id above, then 5m later we got following logs: > {noformat} > Dec 7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 09:03:27.653187 56130 slave.cpp:2531] Handling status update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 in health state unhealthy of framework 20150624-210230-117448108-5050-3678-0001 from executor(1)@172.29.1.12:1651 > Dec 7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: W1207 09:03:27.653282 56130 slave.cpp:2568] Could not find the executor for status update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 in health state unhealthy of framework 20150624-210230-117448108-5050-3678-0001 > Dec 7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 09:03:27.653390 56130 status_update_manager.cpp:317] Received status update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 in health state unhealthy of framework 20150624-210230-117448108-5050-3678-0001 > Dec 7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 09:03:27.653543 56130 slave.cpp:2776] Forwarding the update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 in health state unhealthy of framework 20150624-210230-117448108-5050-3678-0001 to master@172.29.0.5:5050 > Dec 7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 09:03:27.653688 56130 slave.cpp:2709] Sending acknowledgement for status update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 in health state unhealthy of framework 20150624-210230-117448108-5050-3678-0001 to executor(1)@172.29.1.12:1651 > Dec 7 09:03:37 mesos-slave-node-012 mesos-slave[56099]: W1207 09:03:37.654337 56134 status_update_manager.cpp:472] Resending status update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 in health state unhealthy of framework 20150624-210230-117448108-5050-3678-0001 > {noformat} > this caused deactivation of chronos immediately as seen on mesos-master log: > {noformat} > Dec 7 09:03:27 mesos-master-node-001 mesos-master[40898]: I1207 09:03:27.654770 40948 master.cpp:1964] Deactivating framework 20150624-210230-117448108-5050-3678-0001 (chronos-2.4.0) at scheduler-7a4396f7-1f68-4f41-901e-805db5de0432@172.29.0.6:11893 > Dec 7 09:03:27 mesos-master-node-001 mesos-master[40898]: I1207 09:03:27.654989 40948 hierarchical.hpp:400] Deactivated framework 20150624-210230-117448108-5050-3678-0001 > {noformat} > no clue at chronos log at that time, but when we restart chronos master, we got these logs: > {noformat} > Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: Exception in thread "Thread-62550" scala.MatchError: collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 (of class java.lang.String) > Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at org.apache.mesos.chronos.scheduler.jobs.TaskUtils$.parseTaskId(TaskUtils.scala:132) > Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework.statusUpdate(MesosJobFramework.scala:215) > Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source) > Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at java.lang.reflect.Method.invoke(Method.java:497) > Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at com.google.inject.internal.DelegatingInvocationHandler.invoke(DelegatingInvocationHandler.java:37) > Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at com.sun.proxy.$Proxy31.statusUpdate(Unknown Source) > Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: I1207 12:35:25.238782 15035 sched.cpp:1623] Asked to abort the driver > Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: I1207 12:35:25.238836 15035 sched.cpp:856] Aborting framework '20150624-210230-117448108-5050-3678-0001' > {noformat} > Problem is solved by following commands in problematic slave: > {noformat} > service mesos-slave stop > rm -fr /tmp/mesos > service mesos-slave start > {noformat} > I needed to restart chronos master to reactivate framework, I don't know if there is a way to reactivate framework manually rather than restarting master node. > I guess this issue https://github.com/mesos/chronos/issues/569 opened in chronos project is related with this bug too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)