mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Peach (JIRA)" <>
Subject [jira] [Commented] (MESOS-9178) Add a metric for master failover time.
Date Wed, 22 Aug 2018 17:36:00 GMT


James Peach commented on MESOS-9178:

/cc [~bmahler]

> Add a metric for master failover time.
> --------------------------------------
>                 Key: MESOS-9178
>                 URL:
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>            Reporter: Xudong Ni
>            Assignee: Xudong Ni
>            Priority: Minor
> Quote from Yan Xu: Previous the argument against it is that you don't know if all agents
are going to come back after a master failover so there's not a certain point that marks the
end of "full reregistration of all agents". However empirically the number of agents usually
don't change during the failover and there's an upper bound of such wait (after a 10min timeout
the agents that haven't reregistered are going to be marked unreachable so we can just use
that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered from the
registry to be accounted for" i.e., either reregistered or marked as unreachable.
> This is of course looking at failover from an agent reregistration perspective.
> Later after we add framework info persistence, we can similarly define the framework
perspective using reregistration time or reconciliation time.

This message was sent by Atlassian JIRA

View raw message