mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Xu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-7964) Heavy-duty GC makes the agent unresponsive
Date Wed, 27 Sep 2017 00:20:00 GMT

     [ https://issues.apache.org/jira/browse/MESOS-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yan Xu updated MESOS-7964:
--------------------------
    Fix Version/s: 1.5.0

> Heavy-duty GC makes the agent unresponsive
> ------------------------------------------
>
>                 Key: MESOS-7964
>                 URL: https://issues.apache.org/jira/browse/MESOS-7964
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>    Affects Versions: 1.4.0
>            Reporter: Chun-Hung Hsiao
>            Assignee: Chun-Hung Hsiao
>             Fix For: 1.4.1, 1.5.0
>
>
> An agent is observed to performe heavy-duty GC every half an hour:
> {noformat}
> Sep 07 18:15:56 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]:
I0907 18:15:56.900282 16054 slave.cpp:5920] Current disk usage 93.61%. Max allowed age: 0ns
> Sep 07 18:15:56 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]:
I0907 18:15:56.900476 16054 gc.cpp:218] Pruning directories with remaining removal time 1.99022105972148days
> ...
> Sep 07 18:22:08 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]:
I0907 18:22:08.173645 16050 gc.cpp:178] Deleted '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__f33065c9-eb42-44a7-9013-25bafc306bd5'
> ...
> Sep 07 18:41:08 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]:
I0907 18:41:08.195329 16051 slave.cpp:5920] Current disk usage 90.85%. Max allowed age: 0ns
> Sep 07 18:41:08 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]:
I0907 18:41:08.195503 16051 gc.cpp:218] Pruning directories with remaining removal time 1.99028708946667days
> ...
> Sep 07 18:49:01 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]:
I0907 18:49:01.253906 16049 gc.cpp:178] Deleted '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__014b451a-30de-41ee-b0b1-3733c790382c/runs/c5b922e8-eee0-4793-8637-7abbd7f8507e'
> ...
> Sep 07 19:08:01 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]:
I0907 19:08:01.291092 16048 slave.cpp:5920] Current disk usage 91.39%. Max allowed age: 0ns
> Sep 07 19:08:01 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]:
I0907 19:08:01.291285 16048 gc.cpp:218] Pruning directories with remaining removal time 1.99028598086815days
> ...
> Sep 07 19:14:50 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]:
W0907 19:14:50.737226 16050 gc.cpp:174] Failed to delete '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__4139bf2e-e33b-4743-8527-f8f50ac49280/runs/b1991e28-7ff8-476f-8122-1a483e431ff2':
No such file or directory
> ...
> Sep 07 19:33:50 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]:
I0907 19:33:50.758191 16052 slave.cpp:5920] Current disk usage 91.39%. Max allowed age: 0ns
> Sep 07 19:33:50 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]:
I0907 19:33:50.758872 16047 gc.cpp:218] Pruning directories with remaining removal time 1.99028057238519days
> ...
> Sep 07 19:39:43 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]:
I0907 19:39:43.081485 16052 gc.cpp:178] Deleted '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__d89dce1f-609b-4cf8-957a-5ba198be7828'
> ...
> Sep 07 19:59:43 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]:
I0907 19:59:43.150535 16048 slave.cpp:5920] Current disk usage 94.56%. Max allowed age: 0ns
> Sep 07 19:59:43 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]:
I0907 19:59:43.150869 16054 gc.cpp:218] Pruning directories with remaining removal time 1.98959316198222days
> ...
> Sep 07 20:06:16 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]:
I0907 20:06:16.251552 16051 gc.cpp:178] Deleted '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0259/executors/data__45283e7d-9a5e-4d4b-9901-b7f1e096cd54/runs/5cfc5e3e-3975-41aa-846b-c125eb529fbe'
> {noformat}
> Each GC activity took 5+ minutes. During the period, the agent became unresponsive, the
health check timed out, and no endpoint responded as well. When a disk-usage GC is trigged,
around 300 pruning actors would be generated (https://github.com/apache/mesos/blob/master/src/slave/gc.cpp#L229).
My hypothesis is that these actors would used all of the worker threads, and some of them
took a long time to finish (possibly due to many files to delete, or too many fs operations
at once, etc).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message