mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Mahler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-461) Freezer invoked multiple times for the same cgroup.
Date Sat, 04 May 2013 23:00:16 GMT

    [ https://issues.apache.org/jira/browse/MESOS-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649199#comment-13649199
] 

Benjamin Mahler commented on MESOS-461:
---------------------------------------

For a fix here, I'd like to improve the logging and increase the amount of time we allow for
the cgroup to become empty.

I'd also like to see whether this comment is indeed true in the Freezer when the cgroup is
in the FREEZING state, as this was the source of the failure:

      // We don't need to worry about the race condition here as it is not
      // possible to add new processes into this cgroup or remove processes from
      // this cgroup when freezer.state is in FREEZING state.
      foreach (pid_t pid, pids.get()) {
        Try<proc::ProcessStatistics> stat = proc::stat(pid);
        if (stat.isError()) {
          promise.fail("Failed to get process statistics: " + stat.error());
          terminate(self());
          return;
        }
                
> Freezer invoked multiple times for the same cgroup.
> ---------------------------------------------------
>
>                 Key: MESOS-461
>                 URL: https://issues.apache.org/jira/browse/MESOS-461
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>
> I kept the logs relevant to a second executor 'gc', in case they are relevant.
> An OOM is triggered for the executor, after which we freeze it successfully.
> I0504 20:48:44.481727 44475 cgroups_isolation_module.cpp:819] OOM notifier is triggered
for executor E1 of framework F with tag T
> I0504 20:48:44.571171 44475 cgroups_isolation_module.cpp:863] OOM detected for executor
E1 of framework F with tag T
> I0504 20:48:44.573693 44475 cgroups_isolation_module.cpp:889] MEMORY LIMIT: 1707081728
bytes
> MEMORY USAGE: 1707081728 bytes
> MEMORY STATISTICS:
> cache 229376
> rss 1706852352
> mapped_file 16384
> pgpgin 4722471
> pgpgout 4305703
> inactive_anon 0
> active_anon 1706713088
> inactive_file 102400
> active_file 53248
> unevictable 0
> hierarchical_memory_limit 1707081728
> total_cache 229376
> total_rss 1706852352
> total_mapped_file 16384
> total_pgpgin 4722471
> total_pgpgout 4305703
> total_inactive_anon 0
> total_active_anon 1706713088
> total_inactive_file 102400
> total_active_file 53248
> total_unevictable 0
> I0504 20:48:44.573789 44475 cgroups_isolation_module.cpp:534] Killing executor E1 of
framework F
> I0504 20:48:44.578125 44471 cgroups.cpp:1146] Trying to freeze cgroup /cgroup/mesos/F_executor_E1_tag_UUID
> I0504 20:48:46.009472 44464 slave.cpp:830] Status update: task system-gc-30cd1abb-84cf-4a12-8c1b-13396a27867f
of framework F is now in state TASK_FINISHED
> I0504 20:48:46.014714 44464 slave.cpp:727] Got acknowledgement of status update for task
system-gc-30cd1abb-84cf-4a12-8c1b-13396a27867f of framework F
> I0504 20:48:46.015807 44470 cgroups_isolation_module.cpp:571] Changing cgroup controls
for executor gc-30cd1abb-84cf-4a12-8c1b-13396a27867f of framework F with resources cpus=0.19;
disk=15; mem=127
> I0504 20:48:46.024636 44470 cgroups_isolation_module.cpp:676] Updated 'cpu.shares' to
194 for executor gc of framework F
> I0504 20:48:46.043144 44470 cgroups_isolation_module.cpp:774] Updated 'memory.soft_limit_in_bytes'
to 133169152 for executor gc of framework F
> I0504 20:48:51.829247 44472 cgroups_isolation_module.cpp:633] Telling slave of terminated
executor gc of framework F
> I0504 20:48:51.830458 44472 cgroups_isolation_module.cpp:534] Killing executor gc of
framework F
> I0504 20:48:51.830574 44462 slave.cpp:1053] Executor 'gc' of framework F has exited with
status 0
> I0504 20:48:51.846076 44472 cgroups_isolation_module.cpp:819] OOM notifier is triggered
for executor gc of framework F with tag T
> I0504 20:48:51.846158 44472 cgroups_isolation_module.cpp:824] Discarded OOM notifier
for executor gc of framework F with tag T
> I0504 20:48:51.859122 44472 cgroups.cpp:1146] Trying to freeze cgroup /cgroup/mesos/framework_F_executor_gc_tag_T
> I0504 20:48:51.859297 44472 cgroups.cpp:1185] Successfully froze cgroup /cgroup/mesos/framework_F_executor_gc_tag_T
after 1 attempts
> I0504 20:48:51.893443 44473 cgroups.cpp:1161] Trying to thaw cgroup /cgroup/mesos/framework_F_executor_gc_tag_T
> I0504 20:48:51.893656 44473 cgroups.cpp:1268] Successfully thawed /cgroup/mesos/framework_F_executor_gc_tag_T
> I0504 20:48:51.946413 44473 cgroups_isolation_module.cpp:903] Successfully destroyed
the cgroup mesos/framework_F_executor_gc_tag_T
> I0504 20:49:05.155886 44463 cgroups.cpp:1185] Successfully froze cgroup /cgroup/mesos/framework_F_executor_E1_tag_UUID
after 5 attempts
> I0504 20:49:05.181869 44472 cgroups.cpp:1161] Trying to thaw cgroup /cgroup/mesos/framework_F_executor_E1_tag_UUID
> I0504 20:49:05.195574 44472 cgroups.cpp:1268] Successfully thawed /cgroup/mesos/framework_F_executor_E1_tag_UUID
> At this point, the cgroup is frozen and thawed, and we then to freeze the cgroup again!
> I0504 20:49:11.364585 44468 cgroups.cpp:1146] Trying to freeze cgroup /cgroup/mesos/framework_F_executor_E1_tag_UUID
> F0504 20:49:11.687748 44468 cgroups_isolation_module.cpp:905] Failed to destroy the cgroup
mesos/framework_F_executor_E1_tag_UUID: Failed to kill tasks in nested cgroups: Collect failed:
Failed to get process statistics: Failed to open '/proc/33071/stat'
> *** Check failure stack trace: ***
>     @     0x7fb4c489169d  google::LogMessage::Fail()
>     @     0x7fb4c4897307  google::LogMessage::SendToLog()
>     @     0x7fb4c4892f4c  google::LogMessage::Flush()
>     @     0x7fb4c48931b6  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7fb4c46431f0  mesos::internal::slave::CgroupsIsolationModule::destroyWaited()
>     @     0x7fb4c4653f77  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7fb4c4656c24  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7fb4c478cd7b  std::tr1::function<>::operator()()
>     @     0x7fb4c4745f9f  process::ProcessBase::visit()
>     @     0x7fb4c475aec8  process::DispatchEvent::visit()
>     @     0x7fb4c474f11d  process::ProcessManager::resume()
>     @     0x7fb4c474f968  process::schedule()
>     @     0x7fb4c3e6873d  start_thread
>     @     0x7fb4c284cf6d  clone

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message