mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry Zhuk (JIRA)" <>
Subject [jira] [Commented] (MESOS-7713) Optimize number of copies made in dispatch/defer mechanism
Date Thu, 29 Jun 2017 11:18:00 GMT


Dmitry Zhuk commented on MESOS-7713:
- this demonstrates performance improvements for master failover with patches applied. Reregistration
time reduced from 1:20 to 1:00 (not including time to recover registry).

Test environment: scale test cluster simulating ~40K agents and ~100K tasks, dedicated master
hosts, {{--reregistration_backoff_factor=45secs}} on agents.

Versions tested:
1.2.0 - Mesos 1.2.0 +
1.2.0-fix - same as above +,
+ changes to install {{Master::reregisterSlave}} handler with {{mutable_}} versions of protobuf
message fields accessors, take parameters by value and {{std::move}} them to {{defer}}.

Each version was tested 3 times by killing leading master and collecting metrics from newly
elected master logs.
Metrics are calculated by counting number of different messages appearing in logs:
{{reregistering}} - "Re-registering agent ..."
{{ignoring}} - "Ignoring re-register agent message from agent ... as readmission is already
in progress"
{{reregistered}} - "Re-registered agent ..."
{{sending}} - "Sending updated checkpointed resources ... to agent ..."
{{update}} - "Received update of agent ... with total oversubscribed resources ..."
{{pending}} = {{reregistering}} - {{sending}} - indicates number of in-progress reregistrations.
{{offers}} - "Sending ... offers to framework ..."
{{applied_cnt}}, {{applied}} - "Applied ... operations in ...; attempting to update the registry"
(corresponds to number of message and total number of operations)
{{reg_updated}} - "Successfully updated the registry in ..." (extracted duration from message).

> Optimize number of copies made in dispatch/defer mechanism
> ----------------------------------------------------------
>                 Key: MESOS-7713
>                 URL:
>             Project: Mesos
>          Issue Type: Task
>          Components: libprocess
>    Affects Versions: 1.2.0, 1.2.1, 1.3.0
>            Reporter: Dmitry Zhuk
>            Assignee: Dmitry Zhuk
> Profiling agents reregistration for a large cluster shows, that many CPU cycles are spent
on copying protobuf objects. This is partially due to copies made by a code like this:
> {code}
> future.then(defer(self(), &Process::method, param);
> {code}
> {{param}} could be copied 8-10 times before it reaches {{method}}. Specifically, {{reregisterSlave}}
accepts vectors of rather complex objects, which are passed to {{defer}}.
> Currently there are some places in {{defer}}, {{dispatch}} and {{Future}} code, which
could use {{std::move}} and {{std::forward}} to evade some of the copies.

This message was sent by Atlassian JIRA

View raw message