mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anand Mazumdar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-6274) Agent should not allow HTTP executors to re-subscribe before containerizer recovery is done.
Date Thu, 29 Sep 2016 23:50:20 GMT

    [ https://issues.apache.org/jira/browse/MESOS-6274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534444#comment-15534444
] 

Anand Mazumdar commented on MESOS-6274:
---------------------------------------

{noformat}
commit 914ab0f640377cfed9cc8a9dabfa40adec500c0e
Author: Anand Mazumdar <anand@apache.org>
Date:   Thu Sep 29 16:38:07 2016 -0700

    Disallowed HTTP executors to subscribe before containerizer recovery.

    Previously, it was possible for a HTTP based executor to subscribe
    with the agent before the containerizer recovery is done. This
    was a problem since calling `containerizer->update()` etc. would
    result in a failure.

    Review: https://reviews.apache.org/r/52408/

commit 6b99555fa808eb32e32c3624704d0971568ca795
Author: Anand Mazumdar <anand@apache.org>
Date:   Thu Sep 29 16:37:53 2016 -0700

    Added `RecoveryInfo` struct to the agent.

    This struct would container all the recovery related metadata
    on the agent from now on. Eventually, we would add component
    specific recovery information to this struct e.g, the executors
    can now subscribe again with the agent etc.

    Review: https://reviews.apache.org/r/52407/
{noformat}

Keeping the JIRA open for doing the backporting.

> Agent should not allow HTTP executors to re-subscribe before containerizer recovery is
done.
> --------------------------------------------------------------------------------------------
>
>                 Key: MESOS-6274
>                 URL: https://issues.apache.org/jira/browse/MESOS-6274
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.0.0, 1.0.1
>            Reporter: Jie Yu
>            Assignee: Anand Mazumdar
>            Priority: Blocker
>              Labels: mesosphere
>             Fix For: 1.1.0, 1.0.2
>
>
> In the old API, agent will send a reconnect request to the executor and then the executor
will register with the agent.
> Now, in the new API, agent will allow an executor to re-subscribe before containerizer
recovery is done. This is problematic because containerizer has no idea about the containers
yet, calling containerizer->update will lead to a failure, causing the container being
killed.
> {noformat}
> [04:04:11]W:	 [Step 10/10] I0929 04:04:11.693418 22646 containerizer.cpp:580] Recovering
containerizer
> [04:04:11]W:	 [Step 10/10] I0929 04:04:11.693444 22646 containerizer.cpp:636] Recovering
container 568968cc-f41c-475a-bb2b-45d8babd853d for executor 'default' of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
> [04:04:11]W:	 [Step 10/10] I0929 04:04:11.693445 22645 http.cpp:273] HTTP POST for /agent/api/v1/executor
from 172.30.2.198:42683
> [04:04:11]W:	 [Step 10/10] I0929 04:04:11.693567 22645 slave.cpp:3017] Received Subscribe
request for HTTP executor 'default' of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
(via HTTP)
> [04:04:11]W:	 [Step 10/10] I0929 04:04:11.693613 22645 slave.cpp:3080] Creating a marker
file for HTTP based executor 'default' of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
(via HTTP) at path '/mnt/teamcity/temp/buildTmp/SlaveRecoveryTest_0_ROOT_CGROUPS_ReconnectDefaultExecutor_XpQvvJ/meta/slaves/7e4c8518-cb45-4b09-9fa8-c029d56289e2-S0/frameworks/7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000/executors/default/runs/568968cc-f41c-475a-bb2b-45d8babd853d/http.marker'
> [04:04:11]W:	 [Step 10/10] I0929 04:04:11.693733 22645 slave.cpp:3609] Handling status
update TASK_RUNNING (UUID: 6cc3f9a7-d020-46f0-82c1-39fbb9d43786) for task db1f9b1b-75d2-4d96-831f-48d6f28301e8
of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
> [04:04:11]W:	 [Step 10/10] I0929 04:04:11.693801 22645 slave.cpp:3609] Handling status
update TASK_RUNNING (UUID: f80d217b-7844-4134-8cc8-db6998ac437e) for task 3a583cbb-8ea9-440a-864d-e68a23472368
of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
> [04:04:11]W:	 [Step 10/10] E0929 04:04:11.694232 22648 slave.cpp:2055] Failed to update
resources for container 568968cc-f41c-475a-bb2b-45d8babd853d of executor 'default' of framework
7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000, destroying container: Collect failed: Unknown container
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message