mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anindya Sinha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-7087) Consider improving exponential backoff algorithm.
Date Thu, 09 Feb 2017 18:13:41 GMT

    [ https://issues.apache.org/jira/browse/MESOS-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859933#comment-15859933
] 

Anindya Sinha commented on MESOS-7087:
--------------------------------------

Here is a write up on a proposal to address this situation:

https://docs.google.com/document/d/1nUxvh6BbB8jv5G-MvckGj9XzFYLBrUM0O5Go_Zmdftk/edit?usp=sharing

Comments/feedback welcome.

> Consider improving exponential backoff algorithm.
> -------------------------------------------------
>
>                 Key: MESOS-7087
>                 URL: https://issues.apache.org/jira/browse/MESOS-7087
>             Project: Mesos
>          Issue Type: Improvement
>          Components: general
>            Reporter: Anindya Sinha
>            Assignee: Anindya Sinha
>
> There are 3 types of backoff algorithms in use:
> 1) Exponential backoff with randomness, as in framework/agent registration.
> 2) Exponential backoff with no randomness, as in status updates.
> 3) Linear backoff with randomness, as in executor registration.
> Consider framework registration. nth retry attempt is done after a random interval ranging
between [0 .. backoff * 2^(n-1)] as long as each interval is less than 1 min. The default
value for backoff is 2secs.
> Although the current approach brings in exponential backoff with randomness, we have
observed that for clusters with thousands of agents and/or frameworks, the actual retry interval
(which is randomized) can end up being very frequent for a substantial number of agents and/or
frameworks due to the fact that the allowed range is [0 .. <n>], which leads to bombarding
the master with tons of messages thereby overloading it.
> So, the main issues seen are (esp for large number of frameworks and/or agents) are:
> 1) Every subsequent retry should be spaced off by a minimum deterministic amount from
the previous attempt.
> 2) Every subsequent retry should be greater or equal to the previous attempt.
> 3) Maximum retry interval should be configurable since it can be a function of the initial
backoff factor.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message