mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kone (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-6274) Agent should not allow an executor to re-subscribe before containerizer recovery is done.
Date Thu, 29 Sep 2016 05:31:20 GMT

    [ https://issues.apache.org/jira/browse/MESOS-6274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15531847#comment-15531847
] 

Vinod Kone commented on MESOS-6274:
-----------------------------------

Agreed. We should only accept requests in `Slave::http::executor` handler after containerizer
recovery is done.

Possible solutions:

--> Setup the `route(/api/v1/executor)` handler in `Slave::_recover()`. This is a bit hacky?
Also will result in "HTTP::NotFound" response which is not ideal (ideal is ServiceUnavailable).

--> Split the RECOVERING state into 2 states one before containerizer recovery and one
after. This is probably the right solution, but might need updating a bunch of slave methods
that look at state enum.

--> Instead of new state, add a new boolean that is set inside `Slave::_recover()`. Reject
executor API requests with ServiceUnavailable if boolean is not set.

Other options?

> Agent should not allow an executor to re-subscribe before containerizer recovery is done.
> -----------------------------------------------------------------------------------------
>
>                 Key: MESOS-6274
>                 URL: https://issues.apache.org/jira/browse/MESOS-6274
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.0.0, 1.0.1
>            Reporter: Jie Yu
>            Priority: Blocker
>
> In the old API, agent will send a reconnect request to the executor and then the executor
will register with the agent.
> Now, in the new API, agent will allow an executor to re-subscribe before containerizer
recovery is done. This is problematic because containerizer has no idea about the containers
yet, calling containerizer->update will lead to a failure, causing the container being
killed.
> {noformat}
> [04:04:11]W:	 [Step 10/10] I0929 04:04:11.693418 22646 containerizer.cpp:580] Recovering
containerizer
> [04:04:11]W:	 [Step 10/10] I0929 04:04:11.693444 22646 containerizer.cpp:636] Recovering
container 568968cc-f41c-475a-bb2b-45d8babd853d for executor 'default' of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
> [04:04:11]W:	 [Step 10/10] I0929 04:04:11.693445 22645 http.cpp:273] HTTP POST for /agent/api/v1/executor
from 172.30.2.198:42683
> [04:04:11]W:	 [Step 10/10] I0929 04:04:11.693567 22645 slave.cpp:3017] Received Subscribe
request for HTTP executor 'default' of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
(via HTTP)
> [04:04:11]W:	 [Step 10/10] I0929 04:04:11.693613 22645 slave.cpp:3080] Creating a marker
file for HTTP based executor 'default' of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
(via HTTP) at path '/mnt/teamcity/temp/buildTmp/SlaveRecoveryTest_0_ROOT_CGROUPS_ReconnectDefaultExecutor_XpQvvJ/meta/slaves/7e4c8518-cb45-4b09-9fa8-c029d56289e2-S0/frameworks/7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000/executors/default/runs/568968cc-f41c-475a-bb2b-45d8babd853d/http.marker'
> [04:04:11]W:	 [Step 10/10] I0929 04:04:11.693733 22645 slave.cpp:3609] Handling status
update TASK_RUNNING (UUID: 6cc3f9a7-d020-46f0-82c1-39fbb9d43786) for task db1f9b1b-75d2-4d96-831f-48d6f28301e8
of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
> [04:04:11]W:	 [Step 10/10] I0929 04:04:11.693801 22645 slave.cpp:3609] Handling status
update TASK_RUNNING (UUID: f80d217b-7844-4134-8cc8-db6998ac437e) for task 3a583cbb-8ea9-440a-864d-e68a23472368
of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
> [04:04:11]W:	 [Step 10/10] E0929 04:04:11.694232 22648 slave.cpp:2055] Failed to update
resources for container 568968cc-f41c-475a-bb2b-45d8babd853d of executor 'default' of framework
7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000, destroying container: Collect failed: Unknown container
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message