mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry Zhuk (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-7713) Optimize number of copies made in dispatch/defer mechanism
Date Thu, 29 Jun 2017 11:18:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-7713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068196#comment-16068196
] 

Dmitry Zhuk commented on MESOS-7713:
------------------------------------

https://docs.google.com/spreadsheets/d/1xqFxcWxOyjbozro0SkshTIKkaGgShRMN8bqBsdtnl8k/edit?usp=sharing
- this demonstrates performance improvements for master failover with patches applied. Reregistration
time reduced from 1:20 to 1:00 (not including time to recover registry).

Test environment: scale test cluster simulating ~40K agents and ~100K tasks, dedicated master
hosts, {{--reregistration_backoff_factor=45secs}} on agents.

Versions tested:
1.2.0 - Mesos 1.2.0 +  https://reviews.apache.org/r/58355/
1.2.0-fix - same as above + https://reviews.apache.org/r/60002/, https://reviews.apache.org/r/60003/
+ https://reviews.apache.org/r/60472/, https://reviews.apache.org/r/60473/,  https://reviews.apache.org/r/60474/
+ changes to install {{Master::reregisterSlave}} handler with {{mutable_}} versions of protobuf
message fields accessors, take parameters by value and {{std::move}} them to {{defer}}.

Each version was tested 3 times by killing leading master and collecting metrics from newly
elected master logs.
Metrics are calculated by counting number of different messages appearing in logs:
{{reregistering}} - "Re-registering agent ..."
{{ignoring}} - "Ignoring re-register agent message from agent ... as readmission is already
in progress"
{{reregistered}} - "Re-registered agent ..."
{{sending}} - "Sending updated checkpointed resources ... to agent ..."
{{update}} - "Received update of agent ... with total oversubscribed resources ..."
{{pending}} = {{reregistering}} - {{sending}} - indicates number of in-progress reregistrations.
{{offers}} - "Sending ... offers to framework ..."
{{applied_cnt}}, {{applied}} - "Applied ... operations in ...; attempting to update the registry"
(corresponds to number of message and total number of operations)
{{reg_updated}} - "Successfully updated the registry in ..." (extracted duration from message).

> Optimize number of copies made in dispatch/defer mechanism
> ----------------------------------------------------------
>
>                 Key: MESOS-7713
>                 URL: https://issues.apache.org/jira/browse/MESOS-7713
>             Project: Mesos
>          Issue Type: Task
>          Components: libprocess
>    Affects Versions: 1.2.0, 1.2.1, 1.3.0
>            Reporter: Dmitry Zhuk
>            Assignee: Dmitry Zhuk
>
> Profiling agents reregistration for a large cluster shows, that many CPU cycles are spent
on copying protobuf objects. This is partially due to copies made by a code like this:
> {code}
> future.then(defer(self(), &Process::method, param);
> {code}
> {{param}} could be copied 8-10 times before it reaches {{method}}. Specifically, {{reregisterSlave}}
accepts vectors of rather complex objects, which are passed to {{defer}}.
> Currently there are some places in {{defer}}, {{dispatch}} and {{Future}} code, which
could use {{std::move}} and {{std::forward}} to evade some of the copies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message