mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Mahler <benjamin.mah...@gmail.com>
Subject Re: OOM killer stuck when freezing cgroup
Date Fri, 10 May 2013 17:21:10 GMT
Hey Brenden, what kernel version are you running? What version of the code
are you running? Are you seeing this on different machines? Can you confirm
you're unable to attach to the slave with gdb (that would be the easiest
way to figure out what's blocking in the slave)? Is that always the last
line in the log before the processes get stuck?

As an aside, it looks like we aren't properly setting the JVM heap size for
TaskTrackers in the MesosScheduler for Hadoop. We try to estimate the JVM
memory overhead, but it was perhaps not enough to avoid kernel OOMs.


On Fri, May 10, 2013 at 9:57 AM, Brenden Matthews <
brenden.matthews@airbedandbreakfast.com> wrote:

> Hey folks,
>
> I'm bumping in to a problem frequently.  When an executor exceeds its
> memory limit and the executor is deemed in need of execution, the entire
> system becomes stuck (though I'm unsure as to whether it's stuck before or
> after attempting to freeze).  Here's what the mesos slave log looks like:
>
> I0510 04:22:37.935608 13622 cgroups_isolator.cpp:1023] OOM notifier is
> > triggered for executor executor_Task_Tracker_446 of framework
> > 201305100055-1471680778-5050-18199-0000 with uuid
> > 14a21d5a-a670-40fd-91b1-54c42b9a30e0
> > I0510 04:22:37.935925 13622 cgroups_isolator.cpp:1068] OOM detected for
> > executor executor_Task_Tracker_446 of framework
> > 201305100055-1471680778-5050-18199-0000 with uuid
> > 14a21d5a-a670-40fd-91b1-54c42b9a30e0
> > I0510 04:22:37.936982 13622 cgroups_isolator.cpp:1109] Memory limit
> > exceeded: Requested: 2570MB Used: 2570MB
> > MEMORY STATISTICS:
> > cache 1682083840
> > rss 1012756480
> > mapped_file 1085440
> > pgpgin 22863635
> > pgpgout 22205715
> > swap 0
> > pgfault 11881912
> > pgmajfault 304
> > inactive_anon 0
> > active_anon 1012756480
> > inactive_file 1065136128
> > active_file 616947712
> > unevictable 0
> > hierarchical_memory_limit 2694840320
> > hierarchical_memsw_limit 9223372036854775807
> > total_cache 1682083840
> > total_rss 1012756480
> > total_mapped_file 1085440
> > total_pgpgin 22863635
> > total_pgpgout 22205715
> > total_swap 0
> > total_pgfault 11881912
> > total_pgmajfault 304
> > total_inactive_anon 0
> > total_active_anon 1012756480
> > total_inactive_file 1065136128
> > total_active_file 616947712
> > total_unevictable 0
> > I0510 04:22:37.965666 13622 cgroups_isolator.cpp:620] Killing executor
> > executor_Task_Tracker_446 of framework
> > 201305100055-1471680778-5050-18199-0000
> > I0510 04:22:37.967418 13621 cgroups.cpp:1175] Trying to freeze cgroup
> >
> /cgroup/mesos/framework_201305100055-1471680778-5050-18199-0000_executor_executor_Task_Tracker_446_tag_14a21d5a-a670-40fd-91b1-54c42b9a30e0
>
>
> After the last line in the log, both the executing task and the mesos
> process are stuck indefinitely.  Usually the way to resolve this is
> `killall -9 java lt-mesos-slave' or a reboot.
>
> If I cat /cgroup/mesos/framework.../freezer.state, it shows that it's
> freezing.  Since the entire system is stuck, I cannot attach a debugger to
> either the JVM or mesos.
>
> Any ideas?  This problem is very frustrating and I'm stuck at this point.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message