Return-Path: X-Original-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3AF2E10D61 for ; Sat, 4 May 2013 22:56:16 +0000 (UTC) Received: (qmail 22371 invoked by uid 500); 4 May 2013 22:56:16 -0000 Delivered-To: apmail-incubator-mesos-dev-archive@incubator.apache.org Received: (qmail 22342 invoked by uid 500); 4 May 2013 22:56:16 -0000 Mailing-List: contact mesos-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mesos-dev@incubator.apache.org Delivered-To: mailing list mesos-dev@incubator.apache.org Received: (qmail 22332 invoked by uid 99); 4 May 2013 22:56:16 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 04 May 2013 22:56:16 +0000 Date: Sat, 4 May 2013 22:56:15 +0000 (UTC) From: "Benjamin Mahler (JIRA)" To: mesos-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (MESOS-461) Freezer invoked multiple times for the same cgroup. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MESOS-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649198#comment-13649198 ] Benjamin Mahler edited comment on MESOS-461 at 5/4/13 10:55 PM: ---------------------------------------------------------------- I suspect this occurs because in the TasksKiller, after the freeze -> kill -> thaw chain, the cgroup was not empty. If that happens, the freezing will be attempted again, I wonder if a delay beyond 5 seconds is possible for the cgroup to become empty. We don't have logging when the kill chain is attempted again, but I think it's likely. 20:48:44: freeze() attempt #1 20:49:05: frozen after 5 attempts 20:49:05: thaw() attempt 20:49:05: thawed 20:49:11: freeze() attempt #2 Interestingly, it took ~20 seconds to freeze the cgroup, the expected delay is ~ 0.1 seconds * 5 attempts = 0.5 seconds. It took ~5 seconds for the second freeze attempt to occur, this is _expected_ delay when an empty() check fails for the cgroup. The EmptyWatcher will try 50 times at a 0.1 second interval, taking 5 seconds total at a minimum. I suspect the cgroup did not empty out in this amount of time and thus the TasksKiller retried, i.e.: 20:48:44: freeze() attempt #1 20:49:05: frozen after 5 attempts 20:49:05: thaw() attempt 20:49:05: thawed --> 20:49:05:empty() --> 20:49:10:empty() completes, after 50 attempts the cgroup was not empty 20:49:11: freeze() attempt #2 was (Author: bmahler): I suspect this occurs because after the freeze -> kill -> thaw chain, the cgroup was not empty. If that happens, the freezing will be attempted again, I wonder if it's possible that there can be a delay for the cgroup.tasks to become empty. We don't have logging when the chain is attempted again, but I think it's likely. 20:48:44: freeze() attempt #1 20:49:05: frozen after 5 attempts 20:49:05: thaw() attempt 20:49:05: thawed 20:49:11: freeze() attempt #2 Interestingly, it took ~20 seconds to freeze the cgroup, the expected delay is ~ 0.1 seconds * 5 attempts = 0.5 seconds. It took ~5 seconds for the second freeze attempt to occur, this is _expected_ delay when an empty() check fails for the cgroup. The EmptyWatcher will try 50 times at 0.1 seconds = 5 seconds. I suspect the cgroup did not empty out in this amount of time and thus the TasksKiller retried, i.e.: 20:48:44: freeze() attempt #1 20:49:05: frozen after 5 attempts 20:49:05: thaw() attempt 20:49:05: thawed --> 20:49:05:empty() --> 20:49:10:empty() completes, after 50 attempts the cgroup was not empty 20:49:11: freeze() attempt #2 > Freezer invoked multiple times for the same cgroup. > --------------------------------------------------- > > Key: MESOS-461 > URL: https://issues.apache.org/jira/browse/MESOS-461 > Project: Mesos > Issue Type: Bug > Reporter: Benjamin Mahler > > I kept the logs relevant to a second executor 'gc', in case they are relevant. > An OOM is triggered for the executor, after which we freeze it successfully. > I0504 20:48:44.481727 44475 cgroups_isolation_module.cpp:819] OOM notifier is triggered for executor E1 of framework F with tag T > I0504 20:48:44.571171 44475 cgroups_isolation_module.cpp:863] OOM detected for executor E1 of framework F with tag T > I0504 20:48:44.573693 44475 cgroups_isolation_module.cpp:889] MEMORY LIMIT: 1707081728 bytes > MEMORY USAGE: 1707081728 bytes > MEMORY STATISTICS: > cache 229376 > rss 1706852352 > mapped_file 16384 > pgpgin 4722471 > pgpgout 4305703 > inactive_anon 0 > active_anon 1706713088 > inactive_file 102400 > active_file 53248 > unevictable 0 > hierarchical_memory_limit 1707081728 > total_cache 229376 > total_rss 1706852352 > total_mapped_file 16384 > total_pgpgin 4722471 > total_pgpgout 4305703 > total_inactive_anon 0 > total_active_anon 1706713088 > total_inactive_file 102400 > total_active_file 53248 > total_unevictable 0 > I0504 20:48:44.573789 44475 cgroups_isolation_module.cpp:534] Killing executor E1 of framework F > I0504 20:48:44.578125 44471 cgroups.cpp:1146] Trying to freeze cgroup /cgroup/mesos/F_executor_E1_tag_UUID > I0504 20:48:46.009472 44464 slave.cpp:830] Status update: task system-gc-30cd1abb-84cf-4a12-8c1b-13396a27867f of framework F is now in state TASK_FINISHED > I0504 20:48:46.014714 44464 slave.cpp:727] Got acknowledgement of status update for task system-gc-30cd1abb-84cf-4a12-8c1b-13396a27867f of framework F > I0504 20:48:46.015807 44470 cgroups_isolation_module.cpp:571] Changing cgroup controls for executor gc-30cd1abb-84cf-4a12-8c1b-13396a27867f of framework F with resources cpus=0.19; disk=15; mem=127 > I0504 20:48:46.024636 44470 cgroups_isolation_module.cpp:676] Updated 'cpu.shares' to 194 for executor gc of framework F > I0504 20:48:46.043144 44470 cgroups_isolation_module.cpp:774] Updated 'memory.soft_limit_in_bytes' to 133169152 for executor gc of framework F > I0504 20:48:51.829247 44472 cgroups_isolation_module.cpp:633] Telling slave of terminated executor gc of framework F > I0504 20:48:51.830458 44472 cgroups_isolation_module.cpp:534] Killing executor gc of framework F > I0504 20:48:51.830574 44462 slave.cpp:1053] Executor 'gc' of framework F has exited with status 0 > I0504 20:48:51.846076 44472 cgroups_isolation_module.cpp:819] OOM notifier is triggered for executor gc of framework F with tag T > I0504 20:48:51.846158 44472 cgroups_isolation_module.cpp:824] Discarded OOM notifier for executor gc of framework F with tag T > I0504 20:48:51.859122 44472 cgroups.cpp:1146] Trying to freeze cgroup /cgroup/mesos/framework_F_executor_gc_tag_T > I0504 20:48:51.859297 44472 cgroups.cpp:1185] Successfully froze cgroup /cgroup/mesos/framework_F_executor_gc_tag_T after 1 attempts > I0504 20:48:51.893443 44473 cgroups.cpp:1161] Trying to thaw cgroup /cgroup/mesos/framework_F_executor_gc_tag_T > I0504 20:48:51.893656 44473 cgroups.cpp:1268] Successfully thawed /cgroup/mesos/framework_F_executor_gc_tag_T > I0504 20:48:51.946413 44473 cgroups_isolation_module.cpp:903] Successfully destroyed the cgroup mesos/framework_F_executor_gc_tag_T > I0504 20:49:05.155886 44463 cgroups.cpp:1185] Successfully froze cgroup /cgroup/mesos/framework_F_executor_E1_tag_UUID after 5 attempts > I0504 20:49:05.181869 44472 cgroups.cpp:1161] Trying to thaw cgroup /cgroup/mesos/framework_F_executor_E1_tag_UUID > I0504 20:49:05.195574 44472 cgroups.cpp:1268] Successfully thawed /cgroup/mesos/framework_F_executor_E1_tag_UUID > At this point, the cgroup is frozen and thawed, and we then to freeze the cgroup again! > I0504 20:49:11.364585 44468 cgroups.cpp:1146] Trying to freeze cgroup /cgroup/mesos/framework_F_executor_E1_tag_UUID > F0504 20:49:11.687748 44468 cgroups_isolation_module.cpp:905] Failed to destroy the cgroup mesos/framework_F_executor_E1_tag_UUID: Failed to kill tasks in nested cgroups: Collect failed: Failed to get process statistics: Failed to open '/proc/33071/stat' > *** Check failure stack trace: *** > @ 0x7fb4c489169d google::LogMessage::Fail() > @ 0x7fb4c4897307 google::LogMessage::SendToLog() > @ 0x7fb4c4892f4c google::LogMessage::Flush() > @ 0x7fb4c48931b6 google::LogMessageFatal::~LogMessageFatal() > @ 0x7fb4c46431f0 mesos::internal::slave::CgroupsIsolationModule::destroyWaited() > @ 0x7fb4c4653f77 std::tr1::_Function_handler<>::_M_invoke() > @ 0x7fb4c4656c24 std::tr1::_Function_handler<>::_M_invoke() > @ 0x7fb4c478cd7b std::tr1::function<>::operator()() > @ 0x7fb4c4745f9f process::ProcessBase::visit() > @ 0x7fb4c475aec8 process::DispatchEvent::visit() > @ 0x7fb4c474f11d process::ProcessManager::resume() > @ 0x7fb4c474f968 process::schedule() > @ 0x7fb4c3e6873d start_thread > @ 0x7fb4c284cf6d clone -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira