mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Qian Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-8444) Agent miss to detach virtual paths for the executor's sandbox
Date Sun, 14 Jan 2018 09:54:00 GMT

     [ https://issues.apache.org/jira/browse/MESOS-8444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Qian Zhang updated MESOS-8444:
------------------------------
    Description: 
I launched a task group which has one task via {{mesos-execute}}, and that task just did a
{{sleep 10}}, when the task finished, {{Slave::removeExecutor()}} and {{Slave::removeFramework()}}
were called and they will try to gc 3 directories:
# /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>/executors/<executorID>/runs/<containerID>
# /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>/executors/<executorID>
# /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>

For 1 and 2, the code to gc them is like this:
{code}
  garbageCollect(path)
    .then(defer(self(), &Self::detachFile, path));
{code}

So here {{then()}} is used which means we will only do the detach when the gc succeeds. But
the problem is the order of 1, 2 and 3 deleted by gc can not be guaranteed, from my test,
3 will be deleted first for most of times. Since 3 is the parent directory of 1 and 2, so
the gc for 1 and 2 will fail:
{code}
I0111 00:19:33.001655 42889 gc.cpp:208] Deleting /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000
I0111 00:19:33.002576 42889 gc.cpp:218] Deleted '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000'
I0111 00:19:33.004551 42893 gc.cpp:208] Deleting /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor/runs/b067936a-f4c4-4091-b786-4dd4d4d6da15
W0111 00:19:33.004622 42893 gc.cpp:212] Failed to delete '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor/runs/b067936a-f4c4-4091-b786-4dd4d4d6da15':
No such file or directory
I0111 00:19:33.006367 42923 gc.cpp:208] Deleting /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor
W0111 00:19:33.006466 42923 gc.cpp:212] Failed to delete '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor':
No such file or directory
{code}
So we will NOT do the detach for 1 and 2 which is a leak.

  was:
I launched a task group which has one task via {{mesos-execute}}, and that task just did a
{{sleep 10}}, when the task finished, {{Slave::removeExecutor()}} and {{Slave::removeFramework()}}
were called and they will try to gc 3 directories:
# /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>/executors/<executorID>/runs/<containerID>
# /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>/executors/<executorID>
# /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>

For 1 and 2, the code to gc them is like this:
{code}
  garbageCollect(path)
    .then(defer(self(), &Self::detachFile, path));
{code}

So here {{then()}} is used which means we will only do the detach when the gc succeeds. But
the problem is the order of 1, 2 and 3 deleted by gc can not be guaranteed, from my test,
3 will be deleted first for most of times. Since 3 is the parent directory of 1 and 2, so
gc to 1 and 2 will fail:
{code}
I0111 00:19:33.001655 42889 gc.cpp:208] Deleting /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000
I0111 00:19:33.002576 42889 gc.cpp:218] Deleted '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000'
I0111 00:19:33.004551 42893 gc.cpp:208] Deleting /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor/runs/b067936a-f4c4-4091-b786-4dd4d4d6da15
W0111 00:19:33.004622 42893 gc.cpp:212] Failed to delete '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor/runs/b067936a-f4c4-4091-b786-4dd4d4d6da15':
No such file or directory
I0111 00:19:33.006367 42923 gc.cpp:208] Deleting /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor
W0111 00:19:33.006466 42923 gc.cpp:212] Failed to delete '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor':
No such file or directory
{code}
So we will NOT do the detach for 1 and 2 which is a leak.


> Agent miss to detach virtual paths for the executor's sandbox
> -------------------------------------------------------------
>
>                 Key: MESOS-8444
>                 URL: https://issues.apache.org/jira/browse/MESOS-8444
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>            Reporter: Qian Zhang
>            Assignee: Qian Zhang
>
> I launched a task group which has one task via {{mesos-execute}}, and that task just
did a {{sleep 10}}, when the task finished, {{Slave::removeExecutor()}} and {{Slave::removeFramework()}}
were called and they will try to gc 3 directories:
> # /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>/executors/<executorID>/runs/<containerID>
> # /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>/executors/<executorID>
> # /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>
> For 1 and 2, the code to gc them is like this:
> {code}
>   garbageCollect(path)
>     .then(defer(self(), &Self::detachFile, path));
> {code}
> So here {{then()}} is used which means we will only do the detach when the gc succeeds.
But the problem is the order of 1, 2 and 3 deleted by gc can not be guaranteed, from my test,
3 will be deleted first for most of times. Since 3 is the parent directory of 1 and 2, so
the gc for 1 and 2 will fail:
> {code}
> I0111 00:19:33.001655 42889 gc.cpp:208] Deleting /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000
> I0111 00:19:33.002576 42889 gc.cpp:218] Deleted '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000'
> I0111 00:19:33.004551 42893 gc.cpp:208] Deleting /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor/runs/b067936a-f4c4-4091-b786-4dd4d4d6da15
> W0111 00:19:33.004622 42893 gc.cpp:212] Failed to delete '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor/runs/b067936a-f4c4-4091-b786-4dd4d4d6da15':
No such file or directory
> I0111 00:19:33.006367 42923 gc.cpp:208] Deleting /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor
> W0111 00:19:33.006466 42923 gc.cpp:212] Failed to delete '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor':
No such file or directory
> {code}
> So we will NOT do the detach for 1 and 2 which is a leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message