mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brenden Matthews <brenden.matth...@airbedandbreakfast.com>
Subject Re: OOM killer stuck when freezing cgroup
Date Fri, 10 May 2013 17:44:25 GMT
This isn't a kernel OOM.  I started the debugger and let it go overnight
with no luck.  It just doesn't want to budge.  I'm aware of the memory
limit issue, but I don't think that's the real problem here.  Regardless of
whether we run out of memory, the whole system shouldn't freeze when trying
to kill the process.  I'm not sure if this is a kernel problem or a Mesos
problem, but I suspect that the issue isn't with the kernel.  Walking
through the Mesos code, I can't see anything that is obviously wrong
either.  I'm using my own branch from here:
https://github.com/airbnb/mesos/tree/testing.

My next step will be to run a bunch of slaves in GDB and then launch some
tasks.  Hopefully I can stop them and get a backtrace.  I thought it might
be worth asking about it here to see if anyone else has dealt with this and
knows what's going on.

These are EC2 instances running the latest Ubuntu Precise AMIs.  Here's the
`uname -a' from one:

Linux i-b5dbc1d5 3.2.0-41-virtual #66-Ubuntu SMP Thu Apr 25 03:47:17 UTC
2013 x86_64 x86_64 x86_64 GNU/Linux




On Fri, May 10, 2013 at 10:21 AM, Benjamin Mahler <benjamin.mahler@gmail.com
> wrote:

> Hey Brenden, what kernel version are you running? What version of the code
> are you running? Are you seeing this on different machines? Can you confirm
> you're unable to attach to the slave with gdb (that would be the easiest
> way to figure out what's blocking in the slave)? Is that always the last
> line in the log before the processes get stuck?
>
> As an aside, it looks like we aren't properly setting the JVM heap size for
> TaskTrackers in the MesosScheduler for Hadoop. We try to estimate the JVM
> memory overhead, but it was perhaps not enough to avoid kernel OOMs.
>
>
> On Fri, May 10, 2013 at 9:57 AM, Brenden Matthews <
> brenden.matthews@airbedandbreakfast.com> wrote:
>
> > Hey folks,
> >
> > I'm bumping in to a problem frequently.  When an executor exceeds its
> > memory limit and the executor is deemed in need of execution, the entire
> > system becomes stuck (though I'm unsure as to whether it's stuck before
> or
> > after attempting to freeze).  Here's what the mesos slave log looks like:
> >
> > I0510 04:22:37.935608 13622 cgroups_isolator.cpp:1023] OOM notifier is
> > > triggered for executor executor_Task_Tracker_446 of framework
> > > 201305100055-1471680778-5050-18199-0000 with uuid
> > > 14a21d5a-a670-40fd-91b1-54c42b9a30e0
> > > I0510 04:22:37.935925 13622 cgroups_isolator.cpp:1068] OOM detected for
> > > executor executor_Task_Tracker_446 of framework
> > > 201305100055-1471680778-5050-18199-0000 with uuid
> > > 14a21d5a-a670-40fd-91b1-54c42b9a30e0
> > > I0510 04:22:37.936982 13622 cgroups_isolator.cpp:1109] Memory limit
> > > exceeded: Requested: 2570MB Used: 2570MB
> > > MEMORY STATISTICS:
> > > cache 1682083840
> > > rss 1012756480
> > > mapped_file 1085440
> > > pgpgin 22863635
> > > pgpgout 22205715
> > > swap 0
> > > pgfault 11881912
> > > pgmajfault 304
> > > inactive_anon 0
> > > active_anon 1012756480
> > > inactive_file 1065136128
> > > active_file 616947712
> > > unevictable 0
> > > hierarchical_memory_limit 2694840320
> > > hierarchical_memsw_limit 9223372036854775807
> > > total_cache 1682083840
> > > total_rss 1012756480
> > > total_mapped_file 1085440
> > > total_pgpgin 22863635
> > > total_pgpgout 22205715
> > > total_swap 0
> > > total_pgfault 11881912
> > > total_pgmajfault 304
> > > total_inactive_anon 0
> > > total_active_anon 1012756480
> > > total_inactive_file 1065136128
> > > total_active_file 616947712
> > > total_unevictable 0
> > > I0510 04:22:37.965666 13622 cgroups_isolator.cpp:620] Killing executor
> > > executor_Task_Tracker_446 of framework
> > > 201305100055-1471680778-5050-18199-0000
> > > I0510 04:22:37.967418 13621 cgroups.cpp:1175] Trying to freeze cgroup
> > >
> >
> /cgroup/mesos/framework_201305100055-1471680778-5050-18199-0000_executor_executor_Task_Tracker_446_tag_14a21d5a-a670-40fd-91b1-54c42b9a30e0
> >
> >
> > After the last line in the log, both the executing task and the mesos
> > process are stuck indefinitely.  Usually the way to resolve this is
> > `killall -9 java lt-mesos-slave' or a reboot.
> >
> > If I cat /cgroup/mesos/framework.../freezer.state, it shows that it's
> > freezing.  Since the entire system is stuck, I cannot attach a debugger
> to
> > either the JVM or mesos.
> >
> > Any ideas?  This problem is very frustrating and I'm stuck at this point.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message