mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gilbert Song (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-7939) Early disk usage check for garbage collection during recovery
Date Sat, 20 Jan 2018 01:13:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333088#comment-16333088
] 

Gilbert Song commented on MESOS-7939:
-------------------------------------

I am downgrading this issue to `Major` in priority, since we don't have a very good solution
for now and the potential solution would still need operators to manually restart the agent.
Operators could clean up the sandboxes as a workaround.

> Early disk usage check for garbage collection during recovery
> -------------------------------------------------------------
>
>                 Key: MESOS-7939
>                 URL: https://issues.apache.org/jira/browse/MESOS-7939
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>            Reporter: Chun-Hung Hsiao
>            Assignee: Chun-Hung Hsiao
>            Priority: Major
>
> Currently the default value for `disk_watch_interval` is 1 minute. This is not fast enough
and could lead to the following scenario:
> 1. The disk usage was checked and there was not enough headroom:
> {noformat}
> I0901 17:54:33.000000 25510 slave.cpp:5896] Current disk usage 99.87%. Max allowed age:
0ns
> {noformat}
> But no container was pruned because no container had been scheduled for GC.
> 2. A task was completed. The task itself contained a lot of nested containers, each used
a lot of disk space. Note that there is no way for Mesos agent to schedule individual nested
containers for GC since nested containers are not necessarily tied to tasks. When the top-lovel
container is completed, it was scheduled for GC, and the nested containers would be GC'ed
as well: 
> {noformat}
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e/runs/5e70adb1-939e-4d0f-a513-0f77704620bc'
for gc 1.99999466483852days in the future
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e'
for gc 1.99999466405037days in the future
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e/runs/5e70adb1-939e-4d0f-a513-0f77704620bc'
for gc 1.9999946635763days in the future
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e'
for gc 1.99999466324148days in the future
> {noformat}
> 3. Since the next disk usage check was still 40ish seconds away, no GC was performed
even though the disk was full. As a result, Mesos agent failed to checkpoint the task status:
> {noformat}
> I0901 17:54:49.000000 25513 status_update_manager.cpp:323] Received status update TASK_FAILED
(UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84
of framework 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> F0901 17:54:49.000000 25513 slave.cpp:4748] CHECK_READY(future): is FAILED: Failed to
open '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates'
for status updates: No space left on device Failed to handle status update TASK_FAILED (UUID:
bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84
of framework 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> {noformat}
> 4. When the agent restarted, it tried to checkpoint the task status again. However, since
the first disk usage check was scheduled 1 minute after startup, the agent failed before GC
kicked in, falling into a restart failure loop:
> {noformat}
> F0901 17:55:06.000000 31114 slave.cpp:4748] CHECK_READY(future): is FAILED: Failed to
open '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates'
for status updates: No space left on device Failed to handle status update TASK_FAILED (UUID:
fb9c3951-9a93-4925-a7f0-9ba7e38d2398) for task node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84
of framework 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> {noformat}
> We should kick in GC early, so the agent can recover from this state.
> Related ticket: MESOS-7031



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message