mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Kone <vinodk...@gmail.com>
Subject Re: protecting mesos from fat fingers
Date Fri, 02 May 2014 17:47:27 GMT
The GC algorithm should takes into account disk utilization. In other
words, if disk utilization is high sandboxes will be deleted earlier than a
week. Of course, if the disk is getting full faster than GC can react to it
then there might be a problem.


On Fri, May 2, 2014 at 10:35 AM, Dick Davies <dick@hellooperator.net> wrote:

> Not quite - looks to me like mesos slave disks filled with failed jobs
> (because marathon
> continued to throw a broken .zip into them) and with /tmp on the root
> fs the servers became
> unresponsive. Tobi mentions there's a way to set that at deploy time,
> but in this case the
> guy who can't type 'hello world' correctly would have been responsible
> for setting the rate limits
> too (that's me by the way!) so in itself that's not protection from pilot
> error.
>
> I'm not sure if GC was able to clear /var any better (I doubt it very
> much, my impression
> was that's on the order of days). Think it's more the deploy could be
> cancelled better while the
> system was still functioning (speculation - i'm still in early stages
> of learning the internals of this).
>
> On 30 April 2014 22:08, Vinod Kone <vinodkone@gmail.com> wrote:
> > Dick, I've also briefly skimmed at your original email to marathon
> mailing
> > list and it sounded like executor sandboxes were not getting garbage
> > collected (a mesos feature) when the slave work directory was rooted in
> /tmp
> > vs /var? Did I understand that right? If yes, I would love to see some
> logs.
> >
> >
> > On Wed, Apr 30, 2014 at 1:51 PM, Tobias Knaup <tobi@knaup.me> wrote:
> >>
> >> In Marathon you can specify taskRateLimit (max number of tasks to start
> >> per second) as part of your app definition.
> >>
> >>
> >> On Wed, Apr 30, 2014 at 11:30 AM, Dick Davies <dick@hellooperator.net>
> >> wrote:
> >>>
> >>> Managed to take out a mesos slave today with a typo while launching
> >>> a marathon app, and wondered if there are throttles/limits that can be
> >>> applied to repeated launches to limit the risk of such mistakes in the
> >>> future.
> >>>
> >>> I started a thread on the marathon list
> >>>  (
> >>>
> https://groups.google.com/forum/?hl=en#!topic/marathon-framework/4iWLqTYTvgM
> >>> )
> >>>
> >>> [ TL:DR: marathon throws an app that will never deploy correctly at
> >>> slaves
> >>> until the disk fills with debris and the slave dies ]
> >>>
> >>> but I suppose this could be something available in mesos itself.
> >>>
> >>> I can't find a lot of advice about operational aspects of Mesos admin;
> >>> could others here provide some good advice about their experience in
> >>> preventing failed task deploys from causing trouble on their clusters?
> >>>
> >>> Thanks!
> >>
> >>
> >
>

Mime
View raw message