mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bhuvan Arumugam (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-1758) Freezer failure leads to lost task during container destruction.
Date Tue, 16 Sep 2014 05:47:35 GMT

     [ https://issues.apache.org/jira/browse/MESOS-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Bhuvan Arumugam updated MESOS-1758:
-----------------------------------
    Fix Version/s:     (was: 0.21.0)
                   0.20.1

> Freezer failure leads to lost task during container destruction.
> ----------------------------------------------------------------
>
>                 Key: MESOS-1758
>                 URL: https://issues.apache.org/jira/browse/MESOS-1758
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Benjamin Mahler
>            Assignee: Vinod Kone
>             Fix For: 0.20.1
>
>
> In the past we've seen numerous issues around the freezer. Lately, on the 2.6.44 kernel,
we've seen issues where we're unable to freeze the cgroup:
> (1) An oom occurs.
> (2) No indication of oom in the kernel logs.
> (3) The slave is unable to freeze the cgroup.
> (4) The task is marked as lost.
> {noformat}
> I0903 16:46:24.956040 25469 mem.cpp:575] Memory limit exceeded: Requested: 15488MB Maximum
Used: 15488MB
> MEMORY STATISTICS:
> cache 7958691840
> rss 8281653248
> mapped_file 9474048
> pgpgin 4487861
> pgpgout 522933
> pgfault 2533780
> pgmajfault 11
> inactive_anon 0
> active_anon 8281653248
> inactive_file 7631708160
> active_file 326852608
> unevictable 0
> hierarchical_memory_limit 16240345088
> total_cache 7958691840
> total_rss 8281653248
> total_mapped_file 9474048
> total_pgpgin 4487861
> total_pgpgout 522933
> total_pgfault 2533780
> total_pgmajfault 11
> total_inactive_anon 0
> total_active_anon 8281653248
> total_inactive_file 7631728640
> total_active_file 326852608
> total_unevictable 0
> I0903 16:46:24.956848 25469 containerizer.cpp:1041] Container bbb9732a-d600-4c1b-b326-846338c608c3
has reached its limit for resource mem(*):1.62403e+10 and will be terminated
> I0903 16:46:24.957427 25469 containerizer.cpp:909] Destroying container 'bbb9732a-d600-4c1b-b326-846338c608c3'
> I0903 16:46:24.958664 25481 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
> I0903 16:46:34.959529 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
> I0903 16:46:34.962070 25482 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
after 1.710848ms
> I0903 16:46:34.962658 25479 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
> I0903 16:46:44.963349 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
> I0903 16:46:44.965631 25472 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
after 1.588224ms
> I0903 16:46:44.966356 25472 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
> I0903 16:46:54.967254 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
> I0903 16:46:56.008447 25475 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
after 2.15296ms
> I0903 16:46:56.009071 25466 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
> I0903 16:47:06.010329 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
> I0903 16:47:06.012538 25467 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
after 1.643008ms
> I0903 16:47:06.013216 25467 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
> I0903 16:47:12.516348 25480 slave.cpp:3030] Current usage 9.57%. Max allowed age: 5.630238827780799days
> I0903 16:47:16.015192 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
> I0903 16:47:16.017043 25486 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
after 1.511168ms
> I0903 16:47:16.017555 25480 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
> I0903 16:47:19.862746 25483 http.cpp:245] HTTP request for '/slave(1)/stats.json'
> E0903 16:47:24.960055 25472 slave.cpp:2557] Termination of executor 'E' of framework
'201104070004-0000002563-0000' failed: Failed to destroy container: discarded future
> I0903 16:47:24.962054 25472 slave.cpp:2087] Handling status update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747)
for task T of framework 201104070004-0000002563-0000 from @0.0.0.0:0
> I0903 16:47:24.963470 25469 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' to 128MB
for container bbb9732a-d600-4c1b-b326-846338c608c3
> I0903 16:47:24.963541 25471 cpushare.cpp:338] Updated 'cpu.shares' to 256 (cpus 0.25)
for container bbb9732a-d600-4c1b-b326-846338c608c3
> I0903 16:47:24.964756 25471 cpushare.cpp:359] Updated 'cpu.cfs_period_us' to 100ms and
'cpu.cfs_quota_us' to 25ms (cpus 0.25) for container bbb9732a-d600-4c1b-b326-846338c608c3
> I0903 16:47:43.406610 25476 status_update_manager.cpp:320] Received status update TASK_LOST
(UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-0000002563-0000
> I0903 16:47:43.406991 25476 status_update_manager.hpp:342] Checkpointing UPDATE for status
update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-0000002563-0000
> I0903 16:47:43.410475 25476 status_update_manager.cpp:373] Forwarding status update TASK_LOST
(UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-0000002563-0000
to master@<scrubbed_ip>:5050
> I0903 16:47:43.439923 25480 status_update_manager.cpp:398] Received status update acknowledgement
(UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-0000002563-0000
> I0903 16:47:43.440115 25480 status_update_manager.hpp:342] Checkpointing ACK for status
update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-0000002563-0000
> I0903 16:47:43.443595 25480 slave.cpp:2709] Cleaning up executor 'E' of framework 201104070004-0000002563-0000
> {noformat}
> We should consider avoiding the freezer entirely in favor of a kill(2) loop. We don't
have to wait for pid namespaces to remove the freezer dependency.
> At the very least, when the freezer fails, we should proceed with a kill(2) loop to ensure
that we destroy the cgroup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message