mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jie Yu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data
Date Thu, 10 May 2018 17:22:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470780#comment-16470780
] 

Jie Yu commented on MESOS-8830:
-------------------------------

[~zhitao] can you grep for container id `904d8155-e4c3-43e3-bf01-85de6a702149` in the agent
log?

> Agent gc on old slave sandboxes could empty persistent volume data
> ------------------------------------------------------------------
>
>                 Key: MESOS-8830
>                 URL: https://issues.apache.org/jira/browse/MESOS-8830
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.3.1, 1.4.1, 1.5.0
>            Reporter: Zhitao Li
>            Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any container
image thus running on host filesystem) saw its persistent volume data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max allowed age:
4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining removal time
2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining removal time
2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 127.0.0.1:53602
with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
Directory not empty
> E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
Directory not empty
> E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
Directory not empty
> E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
Directory not empty
> E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44:
Directory not empty
> {panel}
> (I can try to provide more logs, depending on how much local archive after rotation has)
> This happened on a 1.3.1 agent although I suspect it's not local to that version.
> The path */var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume*
is a bind mount to a persistent volume. The fact that agent gc touched that process makes
me believe this is what triggered the data loss.
> We had some misconfigurations on out fleet, and I do not know whether the previous slave
(id-ed *70279b87-553a-4213-a85b-46fdc191849d-S4*) was shutdown cleanly yet.
> My questions/suggestions:
> 1) if an executor was asked to shutdown by a new agent (with a new id), how much of persistent
volume clean up code will be executed (especially if new agent do not really know this executor
should be running anymore)?
> 2) should we figure out a way to better protect bind mounts to persistent volumes in
slave/gc.cpp (for instance, skip any path which seems dangerous), to prevent similar problems?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message