aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Franck Cuny via Review Board <nore...@reviews.apache.org>
Subject Re: Review Request 66103: Introduce mesos disk collector
Date Thu, 22 Mar 2018 18:12:49 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199791
-----------------------------------------------------------




src/main/python/apache/thermos/monitoring/disk.py
Lines 132 (patched)
<https://reviews.apache.org/r/66103/#comment280243>

    we should specify a default timeout


- Franck Cuny


On March 22, 2018, 5:16 p.m., Reza Motamedi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> -----------------------------------------------------------
> 
> (Updated March 22, 2018, 5:16 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Franck Cuny, Jordan Ly,
Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage for each
container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially
repeating the work already done by the agent. In practice, we see that disk monitoring is
one of the most expensive resource monitoring tasks. For instance, when there are deeply nested
directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would
be ideal if we delegate the disk monitoring task to the agent and do it only once. With this
approach, when disk collection has improved in the agent (for instance by implementing XFS
isolation), we can simply benefit from it without any code change. Some more information about
the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint
to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor.
Currently, I left the disk collector there to use the `du` implementation. That can be changed
in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation`
for testing. They can be left as is. I included them in this patch to show how this would
work e2e.
> 
> 
> Diffs
> -----
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   RELEASE-NOTES.md 51ab6c724694244bf616b29e9beace4a4a3f5252 
>   docs/reference/observer-configuration.md 8a443c94f7f37f9454989781f722101a97c99f15 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f

>   examples/vagrant/systemd/thermos.service 01925bcd2ae44f100df511f3c3951c3f5a1a72aa 
>   src/main/python/apache/aurora/tools/thermos_observer.py dd9f0c46ceac9e939b1b763073314161de0ea614

>   src/main/python/apache/thermos/monitoring/BUILD 65ba7088f65e7baa5d30744736ba456b46a55e86

>   src/main/python/apache/thermos/monitoring/disk.py 986d33a5000f8d5db15cb639c81f8b1d756ffa05

>   src/main/python/apache/thermos/monitoring/resource.py adcdc751c03460dc801a18278faa96d6bd64722b

>   src/main/python/apache/thermos/observer/task_observer.py a6870d48bddf2a2ccede7bb68195f2baae1d0e47

>   src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py
fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
>   src/test/python/apache/thermos/monitoring/BUILD 8f2b39336dce6c7b580e6ba0009f60afdcb89179

>   src/test/python/apache/thermos/monitoring/test_disk.py 362393bfd1facf3198e2d438d0596b16700b72b8

>   src/test/python/apache/thermos/monitoring/test_resource.py e577e552d4ee1807096a15401851bb9fd95fa426

> 
> 
> Diff: https://reviews.apache.org/r/66103/diff/6/
> 
> 
> Testing
> -------
> 
> - I added unit tests.
> - Tested in vagrant and it works as intenced.
> - I also built and deployed in our test enviroment. In order to measure imporoved performance
I created jobs with nested folders and noticed reduction in CPU utilization of the Observer
process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)
> 
> Here is one specific test setup: On two hosts I created a two tasks. Each task creates
identical nested directory structures and files in them. The overall size is 30GB. test_host_1
runs the current version of observer and test_host_2 runs Observer with this patch and also
has mesos_disk_collection enabled. The results are as follows:
> 
> ```
> rezam[7]TEST_HOST_1 ~ $ while true; do echo `date`; curl localhost:1338/vars -s | grep
cpu; sleep 10; done
> Thu Mar 22 04:36:17 UTC 2018
> observer.observer_cpu 108.9
> Thu Mar 22 04:36:27 UTC 2018
> observer.observer_cpu 123.2
> Thu Mar 22 04:36:38 UTC 2018
> observer.observer_cpu 123.2
> Thu Mar 22 04:36:48 UTC 2018
> observer.observer_cpu 123.2
> Thu Mar 22 04:36:58 UTC 2018
> observer.observer_cpu 111.0
> Thu Mar 22 04:37:08 UTC 2018
> observer.observer_cpu 111.0
> Thu Mar 22 04:37:18 UTC 2018
> observer.observer_cpu 111.0
> 
> 
> rezam[7]TEST_HOST_2 ~ $ while true; do echo `date`; curl localhost:1338/vars -s | grep
cpu; sleep 10; done
> Thu Mar 22 04:36:20 UTC 2018
> observer.observer_cpu 1.3
> Thu Mar 22 04:36:30 UTC 2018
> observer.observer_cpu 1.3
> Thu Mar 22 04:36:40 UTC 2018
> observer.observer_cpu 1.3
> Thu Mar 22 04:36:50 UTC 2018
> observer.observer_cpu 1.2
> Thu Mar 22 04:37:00 UTC 2018
> observer.observer_cpu 1.2
> Thu Mar 22 04:37:10 UTC 2018
> observer.observer_cpu 1.2
> Thu Mar 22 04:37:20 UTC 2018
> observer.observer_cpu 1.8
> ```
> 
> 
> Thanks,
> 
> Reza Motamedi
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message