mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cody Maloney" <c...@mesosphere.io>
Subject Re: Review Request 30587: Additional logging in libprocess
Date Fri, 27 Feb 2015 21:20:40 GMT


> On Feb. 17, 2015, 5:58 a.m., Ben Mahler wrote:
> > 3rdparty/libprocess/src/process.cpp, line 2118
> > <https://reviews.apache.org/r/30587/diff/1/?file=846771#file846771line2118>
> >
> >     Hm.. it looks like `make check` with `GLOG_v=1` generates a ton of noise from
these messages, is this expected?
> >     
> >     If this is to be expected, do we want to bump up the level?

The metrics ones seem to come from overall libprocess shutdown / after completion of tests.
It looks like we shut down metrics __before__ we stop sending messages to it. (Probably not
overly harmful, but not strictly correct). That some code is running after all the tests have
"completed" is worrying to me... (Relevant log chunks below)

The waiter ones are generated in some of the libprocess tests. They don't show up in the test
which actually is causing the code to fire (again worrying...). There is a WaitWaiter process
spawned when you do a process::wait() with a duration. That WaitWaiter has its own timeout
function set with a delay. Since the wait completes before the timeout, the wait() returns,
the WaitWaiter destructs, and doesn't clean up the delay it set. The delay then fires eventually
once the clock has been advanced enough by other tests.

Both of these I believe are programmer errors of not cleaning up properly.

We could fix it by just fixing the cases (Sample patch for the WaitWaiter):
```
diff --git a/3rdparty/libprocess/src/process.cpp b/3rdparty/libprocess/src/process.cpp
index d7fcba90bbe990236a45a76b8f39285a5097861c..f60f21c7b197b30f812fb923616be8693ee5365c 100644
--- a/3rdparty/libprocess/src/process.cpp
+++ b/3rdparty/libprocess/src/process.cpp
@@ -2915,7 +2915,7 @@ public:
   {
     VLOG(3) << "Running waiter process for " << pid;
     link(pid);
-    delay(duration, self(), &WaitWaiter::timeout);
+    timer = delay(duration, self(), &WaitWaiter::timeout);
   }
 
 private:
@@ -2923,6 +2923,7 @@ private:
   {
     VLOG(3) << "Waiter process waited for " << pid;
     *waited = true;
+    Clock::cancel(timer.get());
     terminate(self());
   }
 
@@ -2937,6 +2938,7 @@ private:
   const UPID pid;
   const Duration duration;
   bool* const waited;
+  Option<Timer> timer;
 };

```

What would be better though I think is making it so when a Process / ProcessBase is destroyed
/ de-registers we remove all its outstanding clocks. This prevents future programmer mistakes,
as well as eliminates a possible cause of subtle bugs inside mesos processes: Spawn a process,
set a delay message for it, kill that process, start a new one with the same name. The new
process will get the message once the timeout is reached. Reusing process names happens fairly
frequently in the tests.

The relevant log bits (Generated with GLOG_v=3):

**Metrics log chunk:**
```
[==========] 1 test from 1 test case ran. (103 ms total)
[  PASSED  ] 1 test.

  YOU HAVE 3 DISABLED TESTS

I0227 12:29:24.962259  6723 clock.cpp:295] Clock paused at 0xeb0400
I0227 12:29:24.963186  6732 clock.cpp:135] Handling timers up to 2015-02-27 20:29:30.962255872+00:00
I0227 12:29:24.963199  6732 clock.cpp:142] Have timeout(s) at 2015-02-27 20:29:30.959864832+00:00
I0227 12:29:24.963220  6732 clock.cpp:180] Clock has settled
I0227 12:29:24.963242  6725 process.cpp:2161] Resuming reaper(1)@127.0.0.1:53591 at 2015-02-27
20:29:30.962255872+00:00
I0227 12:29:24.963263  6725 clock.cpp:243] Created a timer for reaper(1)@127.0.0.1:53591 in
100ms in the future (2015-02-27 20:29:31.062255872+00:00)
I0227 12:29:24.972337  6723 clock.cpp:429] Clock is settled
I0227 12:29:24.972363  6723 process.cpp:2466] Donating thread to __gc__@127.0.0.1:53591 while
waiting
I0227 12:29:24.972375  6723 process.cpp:2161] Resuming __gc__@127.0.0.1:53591 at 2015-02-27
20:29:30.962255872+00:00e
I0227 12:29:24.972388  6723 process.cpp:2268] Cleaning up __gc__@127.0.0.1:53591
I0227 12:29:24.972437  6723 process.cpp:2466] Donating thread to __limiter__(1)@127.0.0.1:53591
while waiting
I0227 12:29:24.972446  6723 process.cpp:2161] Resuming __limiter__(1)@127.0.0.1:53591 at 2015-02-27
20:29:30.962255872+00:00
I0227 12:29:24.972457  6723 process.cpp:2268] Cleaning up __limiter__(1)@127.0.0.1:53591
I0227 12:29:24.972487  6726 process.cpp:2161] Resuming __processes__@127.0.0.1:53591 at 2015-02-27
20:29:30.962255872+00:00
I0227 12:29:24.972506  6726 process.cpp:2268] Cleaning up __processes__@127.0.0.1:53591
I0227 12:29:24.972527  6723 process.cpp:2466] Donating thread to help@127.0.0.1:53591 while
waiting
I0227 12:29:24.972535  6723 process.cpp:2161] Resuming help@127.0.0.1:53591 at 2015-02-27
20:29:30.962255872+00:00
I0227 12:29:24.972545  6723 process.cpp:2268] Cleaning up help@127.0.0.1:53591
I0227 12:29:24.972571  6731 process.cpp:2161] Resuming logging@127.0.0.1:53591 at 2015-02-27
20:29:30.962255872+00:00
I0227 12:29:24.972590  6731 process.cpp:2268] Cleaning up logging@127.0.0.1:53591
I0227 12:29:24.972611  6723 process.cpp:2466] Donating thread to metrics@127.0.0.1:53591 while
waiting
I0227 12:29:24.972620  6723 process.cpp:2161] Resuming metrics@127.0.0.1:53591 at 2015-02-27
20:29:30.962255872+00:00
I0227 12:29:24.972630  6723 process.cpp:2268] Cleaning up metrics@127.0.0.1:53591
I0227 12:29:24.972651  6726 process.cpp:2161] Resuming profiler@127.0.0.1:53591 at 2015-02-27
20:29:30.962255872+00:00
I0227 12:29:24.972662  6726 process.cpp:2268] Cleaning up profiler@127.0.0.1:53591
I0227 12:29:24.972677  6729 process.cpp:2161] Resuming reaper(1)@127.0.0.1:53591 at 2015-02-27
20:29:30.962255872+00:00
I0227 12:29:24.972688  6729 process.cpp:2268] Cleaning up reaper(1)@127.0.0.1:53591
I0227 12:29:24.972705  6723 process.cpp:2466] Donating thread to system@127.0.0.1:53591 while
waiting
I0227 12:29:24.972714  6723 process.cpp:2161] Resuming system@127.0.0.1:53591 at 2015-02-27
20:29:30.962255872+00:00
I0227 12:29:24.972730  6723 process.cpp:2118] Dropped / Lost event for PID: metrics@127.0.0.1:53591
I0227 12:29:24.972743  6723 process.cpp:2118] Dropped / Lost event for PID: metrics@127.0.0.1:53591
I0227 12:29:24.972755  6723 process.cpp:2118] Dropped / Lost event for PID: metrics@127.0.0.1:53591
I0227 12:29:24.972767  6723 process.cpp:2118] Dropped / Lost event for PID: metrics@127.0.0.1:53591
I0227 12:29:24.972779  6723 process.cpp:2118] Dropped / Lost event for PID: metrics@127.0.0.1:53591
I0227 12:29:24.972790  6723 process.cpp:2118] Dropped / Lost event for PID: metrics@127.0.0.1:53591
I0227 12:29:24.972800  6723 process.cpp:2268] Cleaning up system@127.0.0.1:53591
make[5]: Leaving directory '/home/cody/projects/mesos/build/3rdparty/libprocess'
```

**__waiter__ log chunks:**
```
[----------] 7 tests from HTTP
[ RUN      ] HTTP.auth
I0227 12:40:42.212824  8256 process.cpp:2161] Resuming help@127.0.0.1:44132 at 2015-02-27
20:40:42.212829184+00:00
I0227 12:40:42.212836  8257 process.cpp:2161] Resuming (1)@127.0.0.1:44132 at 2015-02-27 20:40:42.212840192+00:00
I0227 12:40:42.212839  8253 process.cpp:2151] Spawned process (1)@127.0.0.1:44132
I0227 12:40:42.212970  8253 process.cpp:2151] Spawned process __latch__(1)@127.0.0.1:44132
I0227 12:40:42.212975  8261 process.cpp:2161] Resuming __gc__@127.0.0.1:44132 at 2015-02-27
20:40:42.212979200+00:00
I0227 12:40:42.212995  8253 process.cpp:2151] Spawned process __waiter__(1)@127.0.0.1:44132
I0227 12:40:42.212976  8260 process.cpp:2161] Resuming __latch__(1)@127.0.0.1:44132 at 2015-02-27
20:40:42.212996096+00:00
I0227 12:40:42.212996  8256 process.cpp:2161] Resuming __waiter__(1)@127.0.0.1:44132 at 2015-02-27
20:40:42.212999168+00:00
I0227 12:40:42.213027  8256 process.cpp:2916] Running waiter process for __latch__(1)@127.0.0.1:44132
I0227 12:40:42.213052  8256 clock.cpp:243] Created a timer for __waiter__(1)@127.0.0.1:44132
in 15secs in the future (2015-02-27 20:40:57.213051136+00:00)
I0227 12:40:42.213198  8254 process.cpp:2161] Resuming (1)@127.0.0.1:44132 at 2015-02-27 20:40:42.213201920+00:00
I0227 12:40:42.213217  8254 process.cpp:2771] Handling HTTP event for process '(1)' with path:
'/(1)/auth'
I0227 12:40:42.213254  8254 process.cpp:2151] Spawned process __http__(1)@127.0.0.1:44132
I0227 12:40:42.213258  8258 process.cpp:2161] Resuming __gc__@127.0.0.1:44132 at 2015-02-27
20:40:42.213270784+00:00
I0227 12:40:42.213290  8259 process.cpp:2161] Resuming __http__(1)@127.0.0.1:44132 at 2015-02-27
20:40:42.213294080+00:00
I0227 12:40:42.213408  8257 process.cpp:2161] Resuming __http__(1)@127.0.0.1:44132 at 2015-02-27
20:40:42.213412096+00:00
I0227 12:40:42.213420  8257 process.cpp:2268] Cleaning up __http__(1)@127.0.0.1:44132
I0227 12:40:42.213443  8255 process.cpp:2161] Resuming __gc__@127.0.0.1:44132 at 2015-02-27
20:40:42.213448960+00:00
I0227 12:40:42.213508  8257 process.cpp:2161] Resuming __latch__(1)@127.0.0.1:44132 at 2015-02-27
20:40:42.213510912+00:00
I0227 12:40:42.213518  8257 process.cpp:2268] Cleaning up __latch__(1)@127.0.0.1:44132
I0227 12:40:42.213531  8258 process.cpp:2161] Resuming __waiter__(1)@127.0.0.1:44132 at 2015-02-27
20:40:42.213534976+00:00
I0227 12:40:42.213533  8261 process.cpp:2161] Resuming __gc__@127.0.0.1:44132 at 2015-02-27
20:40:42.213537024+00:00
I0227 12:40:42.213546  8258 process.cpp:2924] Waiter process waited for __latch__(1)@127.0.0.1:44132
I0227 12:40:42.213553  8258 process.cpp:2268] Cleaning up __waiter__(1)@127.0.0.1:44132

...
[       OK ] HTTP.auth (2 ms)

...

[ RUN      ] Metrics.SnapshotTimeout

...
I0227 12:40:42.484825  8253 clock.cpp:429] Clock is settled
I0227 12:40:42.484838  8253 clock.cpp:345] Clock advanced (1secs) to 0x2599420
I0227 12:40:42.484966  8262 clock.cpp:135] Handling timers up to 2015-02-27 20:40:57.446742048+00:00
I0227 12:40:42.484983  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.213051136+00:00
I0227 12:40:42.484994  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.213692928+00:00
I0227 12:40:42.485003  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.214238976+00:00
I0227 12:40:42.485012  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.214799872+00:00
I0227 12:40:42.485019  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.214999808+00:00
I0227 12:40:42.485028  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.215301888+00:00
I0227 12:40:42.485035  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.215611904+00:00
I0227 12:40:42.485046  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.216196096+00:00
I0227 12:40:42.485055  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.216718080+00:00
I0227 12:40:42.485062  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.217353984+00:00
I0227 12:40:42.485070  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.217850880+00:00
I0227 12:40:42.485077  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.219537920+00:00
I0227 12:40:42.485088  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.221196800+00:00
I0227 12:40:42.485096  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.228478976+00:00
I0227 12:40:42.485103  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.334412064+00:00
I0227 12:40:42.485110  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.334772000+00:00
I0227 12:40:42.485118  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.334937120+00:00
I0227 12:40:42.485126  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.335040032+00:00
I0227 12:40:42.485133  8262 clock.cpp:142] Have timeout(s) at 2015-02-27 20:40:57.335128864+00:00
I0227 12:40:42.485159  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(1)@127.0.0.1:44132
I0227 12:40:42.485170  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(2)@127.0.0.1:44132
I0227 12:40:42.485179  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(3)@127.0.0.1:44132
I0227 12:40:42.485188  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(4)@127.0.0.1:44132
I0227 12:40:42.485198  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(5)@127.0.0.1:44132
I0227 12:40:42.485205  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(6)@127.0.0.1:44132
I0227 12:40:42.485214  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(7)@127.0.0.1:44132
I0227 12:40:42.485224  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(8)@127.0.0.1:44132
I0227 12:40:42.485231  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(9)@127.0.0.1:44132
I0227 12:40:42.485241  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(10)@127.0.0.1:44132
I0227 12:40:42.485250  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(11)@127.0.0.1:44132
I0227 12:40:42.485260  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(12)@127.0.0.1:44132
I0227 12:40:42.485268  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(13)@127.0.0.1:44132
I0227 12:40:42.485278  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(14)@127.0.0.1:44132
I0227 12:40:42.485286  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(15)@127.0.0.1:44132
I0227 12:40:42.485297  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(16)@127.0.0.1:44132
I0227 12:40:42.485306  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(17)@127.0.0.1:44132
I0227 12:40:42.485314  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(18)@127.0.0.1:44132
I0227 12:40:42.485324  8262 process.cpp:2118] Dropped / Lost event for PID: __waiter__(19)@127.0.0.1:44132
I0227 12:40:42.485333  8262 clock.cpp:180] Clock has settled
```


- Cody


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30587/#review72676
-----------------------------------------------------------


On Feb. 4, 2015, 12:06 a.m., Cody Maloney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30587/
> -----------------------------------------------------------
> 
> (Updated Feb. 4, 2015, 12:06 a.m.)
> 
> 
> Review request for mesos and Benjamin Hindman.
> 
> 
> Bugs: MESOS-1806
>     https://issues.apache.org/jira/browse/MESOS-1806
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Log when we drop a message since we can't find the local recipient.
> Include details of what was lost in a send failure.
> 
> 
> Diffs
> -----
> 
>   3rdparty/libprocess/src/process.cpp 67b6b3b9c13d95fa1a24b48a12c5c831c7f249bf 
> 
> Diff: https://reviews.apache.org/r/30587/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Cody Maloney
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message