mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-7711) Master updates registry for reregistering agents even when they haven't been unreachable
Date Wed, 22 Nov 2017 19:43:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-7711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16263224#comment-16263224
] 

Yan Xu commented on MESOS-7711:
-------------------------------

Clarification on the fix: by not calling registrar in the mentioned scenario, we eliminated
the delay from the registrar dispatching back into the master actor (which could be backed
up significantly during a master failover) after the operation is done so the overall time
a reregistration request from the agent is spent on the master is reduced and we have seen
~50% reduction in the total time for all agents to reregister after a master failover.

> Master updates registry for reregistering agents even when they haven't been unreachable
> ----------------------------------------------------------------------------------------
>
>                 Key: MESOS-7711
>                 URL: https://issues.apache.org/jira/browse/MESOS-7711
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>            Reporter: Yan Xu
>            Assignee: Yan Xu
>             Fix For: 1.4.0
>
>
> During a master failover we observed many registry updates, on average _one per two agents_,
as indicated by the log line 
> {noformat:title=}
> I0609 04:46:25.220196 48864 registrar.cpp:550] Successfully updated the registry in 42.904064ms
> {noformat}
> [code|https://github.com/apache/mesos/blob/19a6134d03141dc2cb073a904378c2c129b5138d/src/master/registrar.cpp#L550]
> In this case few agents were ever unreachable so most of them are redundant. Associated
with each registry update is also the time spent on applying the operations
> {noformat:title=}
> I0609 04:46:26.475761 48897 registrar.cpp:493] Applied 1 operations in 11.673082ms; attempting
to update the registry
> {noformat}
> [code|https://github.com/apache/mesos/blob/19a6134d03141dc2cb073a904378c2c129b5138d/src/master/registrar.cpp#L493]
> Even though not consuming the time of the Master actor, all agent reregistrations are
guarded and delayed by these operations, and this could be easily avoided by checking with
the {{slaves.recovered}} field in {{Master}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message