aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hussein Elgridly <huss...@broadinstitute.org>
Subject Nuked sandbox directory and failing to run finalizers on disk exceeded
Date Mon, 06 Apr 2015 17:30:21 GMT
Hi folks,

I've just had my first task fail due to exceeding disk capacity, and I've
run into some strange behaviour.

It's a Java process that's running inside a Docker container specified in
the task config. The Java process is failing with java.io.IOException: No
space left on device when attempting to write a file.

Three things are (or aren't) then happening which I think are just plain
wrong:

1. The task is being marked as failed (good!) but isn't reporting that it
exceeded disk limits (bad). I was expecting to see the "Disk limit
exceeded.  Reserved X bytes vs used Y bytes." message, but neither the
Mesos nor Aurora web interfaces are telling me this.
2. The task's sandbox directory is being nuked. All of it, immediately.
There while the job is running, vanished as soon as it fails (I happened to
be watching it live). This makes debugging difficult, and the
Aurora/Thermos web UI clearly has trouble because it reports the resource
requests as all zero when they most definitely weren't.
3. Finalizers aren't running. No finalizers = no error log = no debugging =
sadface. :(

I think what's actually happening here is that the process is running out
of disk on the machine itself and that IOException is propagating up from
the kernel, rather than Mesos killing the process from its disk usage
monitoring.

As such, we're going to try configuring the Mesos slaves with
--resources='disk:some_smaller_value' to leave a little overhead in the
hope that the Mesos disk monitor catches the overusage before the process
attempts to claim the last free block on disk.

I don't know why it'd be nuking the sandbox, though. And is the GC executor
more aggressive about cleaning out old sandbox directories if the disk is
low on free space?

If it helps, we're on Aurora commit
2bf03dc5eae89b1e40bfd47683c54c185c78a9d3.

Thanks,

Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message