Return-Path: X-Original-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E4108926F for ; Sun, 4 Mar 2012 20:40:09 +0000 (UTC) Received: (qmail 3184 invoked by uid 500); 4 Mar 2012 20:40:09 -0000 Delivered-To: apmail-incubator-mesos-dev-archive@incubator.apache.org Received: (qmail 3120 invoked by uid 500); 4 Mar 2012 20:40:09 -0000 Mailing-List: contact mesos-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mesos-dev@incubator.apache.org Delivered-To: mailing list mesos-dev@incubator.apache.org Received: (qmail 3096 invoked by uid 99); 4 Mar 2012 20:40:09 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 04 Mar 2012 20:40:09 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id C080F1C0C0A; Sun, 4 Mar 2012 20:40:08 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============7301314202266129352==" MIME-Version: 1.0 Subject: Re: Review Request: (work in progress!) adding functionality to monitoring resource usage for each executor on the slave From: "Sam Whitlock" To: "Benjamin Hindman" , "Charles Reiss" Date: Sun, 04 Mar 2012 20:40:08 -0000 Message-ID: <20120304204008.346.78346@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org X-ReviewRequest-URL: https://reviews.apache.org/r/4167/ Cc: "Alex Degtiar" , "Sam Whitlock" , "mesos" In-Reply-To: <20120304011555.13650.51202@reviews.apache.org> References: <20120304011555.13650.51202@reviews.apache.org> --===============7301314202266129352== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4167/ ----------------------------------------------------------- (Updated 2012-03-04 20:40:08.696183) Review request for mesos, Benjamin Hindman and Charles Reiss. Changes ------- Diff from parent: - reduced logging level for failing to read info from container to INFO (fr= om ERROR), because it is not necessarily an error - added command-line arg -f for frequency of usage info scraping (1/f is th= e time passed to the delay call) Summary ------- This mega-patch is intended to represent the partial completion of the slav= e monitoring functionality. It is not intended to be committed. Changes bas= ed on comments in this review will be reflected in future reviews that are = smaller and more modular. Proc utils is included in this patch, but is already under review here: htt= ps://reviews.apache.org/r/3050/ The relevant design doc can be found here: https://docs.google.com/document= /d/14Wj9i6TpMR6cV3LL0ySjLjfZOq5QOQeybvt1gSyaQGs/edit The following items are ones where specific feedback is requested: * A better mechanism is needed to control the rate at which the slave asks = each executor for its UsageMessage. This is currently hard-coded to be at 1= second intervals, but could potentially be read as a command-line option o= r from a config file. Is there a better or different way to pass in this va= lue? * Currently, UsageMessages are passed from a ResourceMonitor to the Slave u= sing the Future construct, and used as containers that hold a snapshot of t= he latest usage. This is to prevent unnecessary marshalling and extra data = structures, since messages will eventually be sent in the standard dispatch= style from the slave to the master. Is it fine that we are using Protobuf = messages in this way? There are several changes that are not yet implemented in this patch. These= changes are as follows: * Sufficient tests cases have not yet been written for any component (resou= rce monitor, lxc collector, and process collector). * Code has not been cleaned up to adhere to all style recommendations. * Process collector code needs to be updated to prevent CPU usage spikes wh= en monitored sub-processes die. * Code to send UsageMessages from the slave to the master. This addresses bug MESOS-38. https://issues.apache.org/jira/browse/MESOS-38 Diffs (updated) ----- src/monitoring/process_stats.hpp PRE-CREATION = src/monitoring/resource_collector.hpp PRE-CREATION = src/slave/http.cpp f03815d = src/slave/isolation_module.hpp c896908 = src/slave/isolation_module.cpp 5b7b4a2 = src/slave/lxc_isolation_module.hpp b7beefe = src/slave/lxc_isolation_module.cpp d544625 = src/slave/main.cpp ac780c4 = src/slave/process_based_isolation_module.hpp f6f9554 = src/slave/process_based_isolation_module.cpp 100b1e3 = src/slave/resource_monitor.hpp PRE-CREATION = src/slave/resource_monitor.cpp PRE-CREATION = src/slave/slave.hpp b1a07e9 = src/slave/slave.cpp ce8fda5 = src/tests/Makefile.in 6f51be4 = src/tests/proc_utils_tests.cpp PRE-CREATION = src/tests/process_resource_collector_tests.cpp PRE-CREATION = src/tests/resource_monitor_tests.cpp PRE-CREATION = src/monitoring/process_resource_collector.cpp PRE-CREATION = src/monitoring/process_resource_collector.hpp PRE-CREATION = src/monitoring/linux/proc_utils.cpp PRE-CREATION = src/monitoring/linux/proc_utils.hpp PRE-CREATION = src/monitoring/linux/proc_resource_collector.cpp PRE-CREATION = src/monitoring/linux/proc_resource_collector.hpp PRE-CREATION = src/monitoring/linux/lxc_resource_collector.cpp PRE-CREATION = src/master/master.hpp 53551b0 = src/master/master.cpp 1d3961e = src/messages/messages.proto 11a2c41 = src/monitoring/linux/lxc_resource_collector.hpp PRE-CREATION = src/master/allocator.hpp 1ac435b = src/master/http.cpp 591433a = src/Makefile.am 1137a3e = Diff: https://reviews.apache.org/r/4167/diff Testing ------- Test cases: * A test case exercising the basic monitoring code with a mocked-out collec= tor. * The first of several tests for the process resource monitor, with the pro= c-based collecting mocked out. Some ad-hoc testing with log statements to ensure that the monitoring works= end-to-end from both the container-based and process-based isolation modul= es. Thanks, Sam --===============7301314202266129352==--