mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marco Massenzio (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-2215) The Docker containerizer attempts to recover any task when checkpointing is enabled, not just docker tasks.
Date Fri, 15 May 2015 15:36:00 GMT

     [ https://issues.apache.org/jira/browse/MESOS-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Marco Massenzio updated MESOS-2215:
-----------------------------------
    Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint
5 - 3/20, Mesosphere Q1 Sprint 10 - 5/30  (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere
Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20)

> The Docker containerizer attempts to recover any task when checkpointing is enabled,
not just docker tasks.
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-2215
>                 URL: https://issues.apache.org/jira/browse/MESOS-2215
>             Project: Mesos
>          Issue Type: Bug
>          Components: docker
>    Affects Versions: 0.21.0
>            Reporter: Steve Niemitz
>            Assignee: Timothy Chen
>
> Once the slave restarts and recovers the task, I see this error in the log for all tasks
that were recovered every second or so.  Note, these were NOT docker tasks:
> W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage for  container
7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd
of framework 20150109-161713-715350282-5050-290797-0000: Failed to 'docker inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21':
exit status = exited with status 1 stderr = Error: No such image or container: mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21
> However the tasks themselves are still healthy and running.
> The slave was launched with --containerizers=mesos,docker
> -----
> More info: it looks like the docker containerizer is a little too ambitious about recovering
containers, again this was not a docker task:
> I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container '7b729b89-dc7e-4d08-af97-8cd1af560a21'
for executor 'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd'
of framework 20150109-161713-715350282-5050-290797-0000
> Looking into the source, it looks like the problem is that the ComposingContainerize
runs recover in parallel, but neither the docker containerizer nor mesos containerizer check
if they should recover the task or not (because they were the ones that launched it).  Perhaps
this needs to be written into the checkpoint somewhere?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message