mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Kone <vinodk...@gmail.com>
Subject Re: Mesos slave GC clarification
Date Thu, 26 Dec 2013 19:26:12 GMT
Hi Thomas,

The GC in mesos slave works as follows:

--> Whenever an executor terminates, its sandbox directory is scheduled for
gc for "--gc_delay" seconds into the future by the slave.

--> However the slave also periodically ("--disk_watch_interval") monitors
the disk utilization and expedites the gc based on the usage.

For example if gc_delay is 1 week and the current disk utilization is 80%
then instead of waiting for a week to gc a terminated executor's sandbox
the slave gc'es it after 16.8 hours (= (1- GC_DISK_HEADROOM - 0.8) *
7days). GC_DISK_HEADROOM is currently set to 0.1.

However it might happen that executors are getting launched (and sandboxes
created) at a very high rate. In this case the slave might not able to
react quickly enough to gc sandboxes.

You could grep for "Current usage" in the slave log to see how the disk
utilization varies over time.

HTH,


On Thu, Dec 26, 2013 at 10:56 AM, Thomas Petr <tpetr@hubspot.com> wrote:

> Hi,
>
> We're running Mesos 0.14.0-rc4 on CentOS from the mesosphere repository.
> Last week we had an issue where the mesos-slave process died due running
> out of disk space. [1]
>
> The mesos-slave usage docs mention the "[GC] delay may be shorter
> depending on the available disk usage." Does anyone have any insight into
> how the GC logic works? Is there a configurable threshold percentage or
> amount that will force it to clean up more often?
>
> If the mesos-slave process is going to die due to lack of disk space,
> would it make sense for it to attempt one last GC run before giving up?
>
> Thanks,
> Tom
>
>
> [1]
> Could not create logging file: No space left on device
> COULD NOT CREATE A LOGGINGFILE 20131221-120618.20562!F1221 12:06:18.978813
> 20567 paths.hpp:333] CHECK_SOME(mkdir): Failed to create executor directory
> '/usr/share/hubspot/mesos/slaves/201311111611-3792629514-5050-11268-18/frameworks/Singularity11/executors/singularity-ContactsHadoopDynamicListSegJobs-contacts-wal-dynamic-list-seg-refresher-1387627577839-1-littleslash-us_east_1e/runs/457a8df0-baa7-4d22-a5ac-ba5935ea6032'No
> space left on device
> *** Check failure stack trace: ***
> I1221 12:06:19.008946 20564 cgroups_isolator.cpp:1275] Successfully
> destroyed cgroup
> mesos/framework_Singularity11_executor_singularity-ContactsTasks-parallel-machines:6988:list-intersection-count:1387565552709-1387627447707-1-littleslash-us_east_1e_tag_fc028903-d303-468d-902a-dade8c22e206
>     @     0x7f2c806bcb5d  google::LogMessage::Fail()
>     @     0x7f2c806c0b77  google::LogMessage::SendToLog()
>     @     0x7f2c806be9f9  google::LogMessage::Flush()
>     @     0x7f2c806becfd  google::LogMessageFatal::~LogMessageFatal()
>     @           0x40f6cf  _CheckSome::~_CheckSome()
>     @     0x7f2c804492e3
>  mesos::internal::slave::paths::createExecutorDirectory()
>     @     0x7f2c80418a6d
>  mesos::internal::slave::Framework::launchExecutor()
>     @     0x7f2c80419dd3  mesos::internal::slave::Slave::_runTask()
>     @     0x7f2c8042d5d1  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7f2c805d3ae8  process::ProcessManager::resume()
>     @     0x7f2c805d3e8c  process::schedule()
>     @     0x7f2c7fe41851  start_thread
>     @     0x7f2c7e78794d  clone
>

Mime
View raw message