mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sharma Podila <spod...@netflix.com>
Subject Re: protecting mesos from fat fingers
Date Wed, 07 May 2014 04:03:24 GMT
>
> 1. a rogue job can potentially render slaves useless, and,
> Concretely what kinds of things are you considering here? Are you
> considering jobs that saturate non-isolated resources? Something else?


More precisely, any footprint of the job that persists after job completion
has the potential. For example, footprints in /tmp, changes to system files
if job run with appropriate role/perms, etc. An isolation container that
eliminates all such side effects from the job would be ideal. Cgroups based
isolation wouldn't guarantee it, as an example.

2. a rogue slave (or rather a rogue executor) can blackhole jobs via false
> positive completions
> Concretely what kinds of things are you considered here? A maliciously
> constructed slave? How would these false positives be fabricated? Does
> authentication preclude this?


If a slave or an executor reports a successful job completion upon internal
failure, that would be a false positive. I am only coming up with a couple
of examples at this time, not sure if there's a bigger generalization here,
will get back to you on that. For example, a problem with user credentials
or file/network access leading to failure in loading the job but the
executor reports normal job completion. Or, an executor that is watching
the launched job processes incorrectly concludes normal completion, where
as the actual job processes died almost immediately after launching.

Also, a slave/executor reporting too many failures (as opposed to false
positives) can be another case to consider and remove from usable
slaves/executors.

Sharma



On Tue, May 6, 2014 at 12:05 PM, Benjamin Mahler
<benjamin.mahler@gmail.com>wrote:

> Interesting points, I'd like to understand your two cases above:
>
> 1. a rogue job can potentially render slaves useless, and,
>
> Concretely what kinds of things are you considering here? Are you
> considering jobs that saturate non-isolated resources? Something else?
>
> 2. a rogue slave (or rather a rogue executor) can blackhole jobs via false
> positive completions
>
> Concretely what kinds of things are you considered here? A maliciously
> constructed slave? How would these false positives be fabricated? Does
> authentication preclude this?
>
>
> On Fri, May 2, 2014 at 11:00 AM, Sharma Podila <spodila@netflix.com>wrote:
>
>> Although I am not as familiar with Marathon specifics, in general,
>>
>> 1. a rogue job can potentially render slaves useless, and,
>> 2. a rogue slave (or rather a rogue executor) can blackhole jobs via
>> false positive completions
>>
>> A strategy that helps with #1 is to limit the number of re-launches of an
>> individual job/task upon failure. Even better if this is done with failure
>> rate. Simple rate limiting may only delay the problem for a while.
>> A strategy that helps with #2 is to "disable" the slave from further
>> launches when too many failures are reported from it in a given time
>> period. This can render many slaves disabled and reduce cluster throughput
>> (which should alert the operator), which is better than falsely putting all
>> jobs into completion state.
>>
>> An out of band monitor that watches job/task lifecycle events can achieve
>> this, for example, using a stream processing technique over the continuous
>> event stream.
>>
>> Sharma
>>
>>
>>
>> On Fri, May 2, 2014 at 10:35 AM, Dick Davies <dick@hellooperator.net>wrote:
>>
>>> Not quite - looks to me like mesos slave disks filled with failed jobs
>>> (because marathon
>>> continued to throw a broken .zip into them) and with /tmp on the root
>>> fs the servers became
>>> unresponsive. Tobi mentions there's a way to set that at deploy time,
>>> but in this case the
>>> guy who can't type 'hello world' correctly would have been responsible
>>> for setting the rate limits
>>> too (that's me by the way!) so in itself that's not protection from
>>> pilot error.
>>>
>>> I'm not sure if GC was able to clear /var any better (I doubt it very
>>> much, my impression
>>> was that's on the order of days). Think it's more the deploy could be
>>> cancelled better while the
>>> system was still functioning (speculation - i'm still in early stages
>>> of learning the internals of this).
>>>
>>> On 30 April 2014 22:08, Vinod Kone <vinodkone@gmail.com> wrote:
>>> > Dick, I've also briefly skimmed at your original email to marathon
>>> mailing
>>> > list and it sounded like executor sandboxes were not getting garbage
>>> > collected (a mesos feature) when the slave work directory was rooted
>>> in /tmp
>>> > vs /var? Did I understand that right? If yes, I would love to see some
>>> logs.
>>> >
>>> >
>>> > On Wed, Apr 30, 2014 at 1:51 PM, Tobias Knaup <tobi@knaup.me> wrote:
>>> >>
>>> >> In Marathon you can specify taskRateLimit (max number of tasks to
>>> start
>>> >> per second) as part of your app definition.
>>> >>
>>> >>
>>> >> On Wed, Apr 30, 2014 at 11:30 AM, Dick Davies <dick@hellooperator.net
>>> >
>>> >> wrote:
>>> >>>
>>> >>> Managed to take out a mesos slave today with a typo while launching
>>> >>> a marathon app, and wondered if there are throttles/limits that
can
>>> be
>>> >>> applied to repeated launches to limit the risk of such mistakes
in
>>> the
>>> >>> future.
>>> >>>
>>> >>> I started a thread on the marathon list
>>> >>>  (
>>> >>>
>>> https://groups.google.com/forum/?hl=en#!topic/marathon-framework/4iWLqTYTvgM
>>> >>> )
>>> >>>
>>> >>> [ TL:DR: marathon throws an app that will never deploy correctly
at
>>> >>> slaves
>>> >>> until the disk fills with debris and the slave dies ]
>>> >>>
>>> >>> but I suppose this could be something available in mesos itself.
>>> >>>
>>> >>> I can't find a lot of advice about operational aspects of Mesos
>>> admin;
>>> >>> could others here provide some good advice about their experience
in
>>> >>> preventing failed task deploys from causing trouble on their
>>> clusters?
>>> >>>
>>> >>> Thanks!
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Mime
View raw message