mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott D.W. Rankin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
Date Mon, 07 Mar 2016 14:10:40 GMT

    [ https://issues.apache.org/jira/browse/MESOS-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183035#comment-15183035
] 

Scott D.W. Rankin commented on MESOS-2684:
------------------------------------------

Interesting that you should say that - we did have tmpwatch running on our servers, and we
turned it off because it seemed like that was the root cause of the issue.  However, we are
still seeing it now with our new servers, Centos 7 + Mesos 0.23 + Marathon 0.11.1.  AFAIK,
nothing is touching /tmp.  

The more frustrating thing is that even when the slave crashes, it cannot seem to recover
other executors running on that server and so the whole slave fails.  

> mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
> --------------------------------------------------------------------------
>
>                 Key: MESOS-2684
>                 URL: https://issues.apache.org/jira/browse/MESOS-2684
>             Project: Mesos
>          Issue Type: Bug
>          Components: docker, slave
>    Affects Versions: 0.21.1
>            Reporter: Steven Schlansker
>         Attachments: mesos-slave-restart.txt
>
>
> mesos-slave can encounter a variety of problems while attempting to launch a task.  If
the task fails, that is unfortunate, but not the end of the world.  Other tasks should not
be affected.
> However, if the task failure happens to trigger an assertion, the entire slave comes
crashing down:
> F0501 19:10:46.095464  1705 paths.hpp:342] CHECK_SOME(mkdir): No space left on device
Failed to create executor directory '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01'
> Immediately afterwards, all tasks on this slave were declared TASK_KILLED when mesos-slave
restarted.
> Something as simple as a 'mkdir' failing is not worthy of an assertion failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message