mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Xu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MESOS-7710) Mesos agent registration retry backoff window always has a zero lower-bound
Date Thu, 22 Jun 2017 15:59:00 GMT
Yan Xu created MESOS-7710:
-----------------------------

             Summary: Mesos agent registration retry backoff window always has a zero lower-bound
                 Key: MESOS-7710
                 URL: https://issues.apache.org/jira/browse/MESOS-7710
             Project: Mesos
          Issue Type: Bug
          Components: agent, master
            Reporter: Yan Xu


In a large cluster when the master fails over, agents retry reregistration with a backoff
algorithm that expands a randomization window with its lower bound stays zero. However in
such a situation the master is heavily backlogged so even if it's just a portion of the agents
that are retrying too fast it still aggravates the situation for everyone. 

The proposal is to increase the lower bound during the backoff. However we should probably
not create a customized backoff algorithm for this particular case but have it depend on generic
solution MESOS-7646. 

This shouldn't increase the burden of the operator by requiring them to tune these parameters
according to cluster size but rather rely on sensible defaults.

To combat dropped messages, this perhaps works better with MESOS-7688: if the agents only
start reregistration when the master is recovered, then it's more reasonable to backoff more
aggressively.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message