mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ian Babrou (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-6118) Agent would crash with docker container tasks due to host mount table read.
Date Fri, 23 Sep 2016 11:31:20 GMT

    [ https://issues.apache.org/jira/browse/MESOS-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15516181#comment-15516181
] 

Ian Babrou commented on MESOS-6118:
-----------------------------------

I also experience this issue:

{noformat}
Sep 23 11:07:39 myhost mesos-agent[4980]: I0923 11:07:39.763520  4995 slave.cpp:3211] Handling
status update TASK_FAILED (UUID: 084ace64-a1bf-495d-9769-ad831b53d1bf) for task pdx_phoenix.e7b89f12-817d-11e6-9c3a-2c600cbc2dd1
of framework 20150606-001827-252388362-5050-5982-0001 from executor(1)@10.10.23.25:46833
Sep 23 11:07:39 myhost mesos-agent[4980]: I0923 11:07:39.763664  4991 slave.cpp:6014] Terminating
task pdx_phoenix.e7b89f12-817d-11e6-9c3a-2c600cbc2dd1
Sep 23 11:07:39 myhost mesos-agent[4980]: I0923 11:07:39.763825  5002 docker.cpp:972] Running
docker -H unix:///var/run/docker.sock inspect mesos-dfc1b04b-941b-4d93-adf4-c65ab307ee2c-S2.c40cea8c-31a9-468f-a183-ed9851cd5aa8
Sep 23 11:07:39 myhost mesos-agent[4980]: I0923 11:07:39.821267  4987 status_update_manager.cpp:320]
Received status update TASK_FAILED (UUID: 084ace64-a1bf-495d-9769-ad831b53d1bf) for task pdx_phoenix.e7b89f12-817d-11e6-9c3a-2c600cbc2dd1
of framework 20150606-001827-252388362-5050-5982-0001
Sep 23 11:07:39 myhost mesos-agent[4980]: I0923 11:07:39.821296  4987 status_update_manager.cpp:825]
Checkpointing UPDATE for status update TASK_FAILED (UUID: 084ace64-a1bf-495d-9769-ad831b53d1bf)
for task pdx_phoenix.e7b89f12-817d-11e6-9c3a-2c600cbc2dd1 of framework 20150606-001827-252388362-5050-5982-0001
Sep 23 11:07:39 myhost mesos-agent[4980]: I0923 11:07:39.844871  4987 status_update_manager.cpp:374]
Forwarding update TASK_FAILED (UUID: 084ace64-a1bf-495d-9769-ad831b53d1bf) for task pdx_phoenix.e7b89f12-817d-11e6-9c3a-2c600cbc2dd1
of framework 20150606-001827-252388362-5050-5982-0001 to the agent
Sep 23 11:07:39 myhost mesos-agent[4980]: I0923 11:07:39.844970  5009 slave.cpp:3604] Forwarding
the update TASK_FAILED (UUID: 084ace64-a1bf-495d-9769-ad831b53d1bf) for task pdx_phoenix.e7b89f12-817d-11e6-9c3a-2c600cbc2dd1
of framework 20150606-001827-252388362-5050-5982-0001 to master@10.10.11.16:5050
Sep 23 11:07:39 myhost mesos-agent[4980]: I0923 11:07:39.845062  5009 slave.cpp:3498] Status
update manager successfully handled status update TASK_FAILED (UUID: 084ace64-a1bf-495d-9769-ad831b53d1bf)
for task pdx_phoenix.e7b89f12-817d-11e6-9c3a-2c600cbc2dd1 of framework 20150606-001827-252388362-5050-5982-0001
Sep 23 11:07:39 myhost mesos-agent[4980]: I0923 11:07:39.845074  5009 slave.cpp:3514] Sending
acknowledgement for status update TASK_FAILED (UUID: 084ace64-a1bf-495d-9769-ad831b53d1bf)
for task pdx_phoenix.e7b89f12-817d-11e6-9c3a-2c600cbc2dd1 of framework 20150606-001827-252388362-5050-5982-0001
to executor(1)@10.10.23.25:46833
Sep 23 11:07:39 myhost mesos-agent[4980]: I0923 11:07:39.864859  4987 slave.cpp:3686] Received
ping from slave-observer(149)@10.10.11.16:5050
Sep 23 11:07:39 myhost mesos-agent[4980]: I0923 11:07:39.955936  4995 status_update_manager.cpp:392]
Received status update acknowledgement (UUID: 084ace64-a1bf-495d-9769-ad831b53d1bf) for task
pdx_phoenix.e7b89f12-817d-11e6-9c3a-2c600cbc2dd1 of framework 20150606-001827-252388362-5050-5982-0001
Sep 23 11:07:39 myhost mesos-agent[4980]: I0923 11:07:39.956001  4995 status_update_manager.cpp:825]
Checkpointing ACK for status update TASK_FAILED (UUID: 084ace64-a1bf-495d-9769-ad831b53d1bf)
for task pdx_phoenix.e7b89f12-817d-11e6-9c3a-2c600cbc2dd1 of framework 20150606-001827-252388362-5050-5982-0001
Sep 23 11:07:39 myhost mesos-agent[4980]: I0923 11:07:39.982950  4995 status_update_manager.cpp:528]
Cleaning up status update stream for task pdx_phoenix.e7b89f12-817d-11e6-9c3a-2c600cbc2dd1
of framework 20150606-001827-252388362-5050-5982-0001
Sep 23 11:07:39 myhost mesos-agent[4980]: I0923 11:07:39.983119  4995 slave.cpp:2597] Status
update manager successfully handled status update acknowledgement (UUID: 084ace64-a1bf-495d-9769-ad831b53d1bf)
for task pdx_phoenix.e7b89f12-817d-11e6-9c3a-2c600cbc2dd1 of framework 20150606-001827-252388362-5050-5982-0001
Sep 23 11:07:39 myhost mesos-agent[4980]: I0923 11:07:39.983131  4995 slave.cpp:6055] Completing
task pdx_phoenix.e7b89f12-817d-11e6-9c3a-2c600cbc2dd1
Sep 23 11:07:40 myhost mesos-agent[4980]: I0923 11:07:40.667191  4981 process.cpp:3323] Handling
HTTP event for process 'slave(1)' with path: '/slave(1)/state'
Sep 23 11:07:40 myhost mesos-agent[4980]: I0923 11:07:40.667413  4983 http.cpp:270] HTTP GET
for /slave(1)/state from 10.10.19.24:33570 with User-Agent='Go-http-client/1.1'
Sep 23 11:07:40 myhost mesos-agent[4980]: I0923 11:07:40.669677  5012 process.cpp:3323] Handling
HTTP event for process 'files' with path: '/files/download'
Sep 23 11:07:40 myhost mesos-agent[4980]: I0923 11:07:40.670250  5005 process.cpp:1280] Sending
file at '/state/var/lib/mesos/slaves/dfc1b04b-941b-4d93-adf4-c65ab307ee2c-S2/frameworks/20150606-001827-252388362-5050-5982-0001/executors/pdx_phoenix.e7b89f12-817d-11e6-9c3a-2c600cbc2dd1/runs/c40cea8c-31a9-468f-a183-ed9851cd5aa8/stdout'
with length 1335
Sep 23 11:07:40 myhost mesos-agent[4980]: I0923 11:07:40.765249  5008 slave.cpp:3732] executor(1)@10.10.23.25:46833
exited
Sep 23 11:07:40 myhost mesos-agent[4980]: I0923 11:07:40.783426  4982 process.cpp:3323] Handling
HTTP event for process 'files' with path: '/files/download'
Sep 23 11:07:40 myhost mesos-agent[4980]: I0923 11:07:40.783843  4995 process.cpp:1280] Sending
file at '/state/var/lib/mesos/slaves/dfc1b04b-941b-4d93-adf4-c65ab307ee2c-S2/frameworks/20150606-001827-252388362-5050-5982-0001/executors/pdx_phoenix.e7b89f12-817d-11e6-9c3a-2c600cbc2dd1/runs/c40cea8c-31a9-468f-a183-ed9851cd5aa8/stderr'
with length 3543
Sep 23 11:07:40 myhost mesos-agent[4980]: I0923 11:07:40.826164  5000 docker.cpp:2132] Executor
for container 'c40cea8c-31a9-468f-a183-ed9851cd5aa8' has exited
Sep 23 11:07:40 myhost mesos-agent[4980]: I0923 11:07:40.826181  5000 docker.cpp:1852] Destroying
container 'c40cea8c-31a9-468f-a183-ed9851cd5aa8'
Sep 23 11:07:40 myhost mesos-agent[4980]: I0923 11:07:40.826207  5000 docker.cpp:1980] Running
docker stop on container 'c40cea8c-31a9-468f-a183-ed9851cd5aa8'
Sep 23 11:07:40 myhost mesos-agent[4980]: F0923 11:07:40.826529  5000 fs.cpp:140] Check failed:
!visitedParents.contains(parentId)
Sep 23 11:07:40 myhost mesos-agent[4980]: *** Check failure stack trace: ***
Sep 23 11:07:40 myhost mesos-agent[4980]: @     0x7f50dd98953d  google::LogMessage::Fail()
Sep 23 11:07:40 myhost mesos-agent[4980]: @     0x7f50dd98b1bd  google::LogMessage::SendToLog()
Sep 23 11:07:40 myhost mesos-agent[4980]: @     0x7f50dd989102  google::LogMessage::Flush()
Sep 23 11:07:40 myhost mesos-agent[4980]: @     0x7f50dd98bba9  google::LogMessageFatal::~LogMessageFatal()
Sep 23 11:07:40 myhost mesos-agent[4980]: @     0x7f50dd45883d  _ZNSt17_Function_handlerIFviEZN5mesos8internal2fs14MountInfoTable4readERK6OptionIiEbEUliE_E9_M_invokeERKSt9_Any_datai
Sep 23 11:07:40 myhost mesos-agent[4980]: @     0x7f50dd4587a5  _ZNSt17_Function_handlerIFviEZN5mesos8internal2fs14MountInfoTable4readERK6OptionIiEbEUliE_E9_M_invokeERKSt9_Any_datai
Sep 23 11:07:40 myhost mesos-agent[4980]: @     0x7f50dd45fc5a  mesos::internal::fs::MountInfoTable::read()
Sep 23 11:07:40 myhost mesos-agent[4980]: @     0x7f50dd213346  mesos::internal::slave::DockerContainerizerProcess::unmountPersistentVolumes()
Sep 23 11:07:40 myhost mesos-agent[4980]: @     0x7f50dd22f157  mesos::internal::slave::DockerContainerizerProcess::___destroy()
Sep 23 11:07:40 myhost mesos-agent[4980]: @     0x7f50dd92d094  process::ProcessManager::resume()
Sep 23 11:07:40 myhost mesos-agent[4980]: @     0x7f50dd92d3b7  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
Sep 23 11:07:40 myhost mesos-agent[4980]: @     0x7f50dc007970  (unknown)
Sep 23 11:07:40 myhost mesos-agent[4980]: @     0x7f50dbb260a4  start_thread
Sep 23 11:07:40 myhost mesos-agent[4980]: @     0x7f50db85b87d  (unknown)
Sep 23 11:07:40 myhost systemd[1]: mesos-agent.service: main process exited, code=killed,
status=6/ABRT
Sep 23 11:07:40 myhost systemd[1]: Unit mesos-agent.service entered failed state.
{noformat}

> Agent would crash with docker container tasks due to host mount table read.
> ---------------------------------------------------------------------------
>
>                 Key: MESOS-6118
>                 URL: https://issues.apache.org/jira/browse/MESOS-6118
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 1.0.1
>         Environment: Build: 2016-08-26 23:06:27 by centos
> Version: 1.0.1
> Git tag: 1.0.1
> Git SHA: 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3
> systemd version `219` detected
> Inializing systemd state
> Created systemd slice: `/run/systemd/system/mesos_executors.slice`
> Started systemd slice `mesos_executors.slice`
> Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
>  Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> Linux ip-10-254-192-40 3.10.0-327.28.3.el7.x86_64 #1 SMP Thu Aug 18 19:05:49 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux
>            Reporter: Jamie Briant
>            Assignee: Kevin Klues
>            Priority: Critical
>              Labels: linux, slave
>             Fix For: 1.1.0, 1.0.2
>
>         Attachments: crashlogfull.log, cycle2.log, cycle3.log, cycle5.log, cycle6.log,
slave-crash.log
>
>
> I have a framework which schedules thousands of short running (a few seconds to a few
minutes) of tasks, over a period of several minutes. In 1.0.1, the slave process will crash
every few minutes (with systemd restarting it).
> Crash is:
> Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: F0901 20:52:23.905678  1232 fs.cpp:140]
Check failed: !visitedParents.contains(parentId)
> Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: *** Check failure stack trace: ***
> Version 1.0.0 works without this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message