mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Rojas (JIRA)" <>
Subject [jira] [Commented] (MESOS-6907) FutureTest.After3 is flaky
Date Mon, 16 Jan 2017 14:46:26 GMT


Alexander Rojas commented on MESOS-6907:

So, after verifying my theory was correct. Timers are executed in [{{void process::timedout()}}|].
Moreover, {{libprocess::timedout()}} is not executed in any libprocess thread, but in the
libevent loop [here|],
and [here|].

What all this causes is that timers are executed in batch, and only when all the timers of
a batch are executed, these timers belonging to that batch will be destroyed, which is the
cause of the flakiness. It can be solved by forcing a second batch to run (since they run
on the same thread every time) by creating a second timer and manipulating the {{Clock}},
so that the second timer is schedule in a different later batch and then waiting for the thunk
of that timer to be executed. I proposed a patch which does just that:

[r/55576/|]: Fixes FutureTest.After3 flakiness.

> FutureTest.After3 is flaky
> --------------------------
>                 Key: MESOS-6907
>                 URL:
>             Project: Mesos
>          Issue Type: Bug
>          Components: libprocess
>            Reporter: Alexander Rojas
> There is apparently a race condition between the time an instance of {{Future<T>}}
goes out of scope and when the enclosing data is actually deleted, if {{Future<T>::after(Duration,
lambda::function<Future<T>(const Future<T>&)>)}} is called.
> The issue is more likely to occur if the machine is under load or if it is not a very
powerful one. The easiest way to reproduce it is to run:
> {code}
> $ stress -c 4 -t 2600 -d 2 -i 2 &
> $ ./libprocess-tests --gtest_filter="FutureTest.After3" --gtest_repeat=-1 --gtest_break_on_failure
> {code}
> An exploratory fix for the issue is to change the test to:
> {code}
> TEST(FutureTest, After3)
> {
>   Future<Nothing> future;
>   process::WeakFuture<Nothing> weak_future(future);
>   EXPECT_SOME(weak_future.get());
>   {
>     Clock::pause();
>     // The original future disappears here. After this call the
>     // original future goes out of scope and should not be reachable
>     // anymore.
>     future = future
>       .after(Milliseconds(1), [](Future<Nothing> f) {
>         f.discard();
>         return Nothing();
>       });
>     Clock::advance(Seconds(2));
>     Clock::settle();
>     AWAIT_READY(future);
>   }
>   if (weak_future.get().isSome()) {
>     os::sleep(Seconds(1));
>   }
>   EXPECT_NONE(weak_future.get());
>   EXPECT_FALSE(future.hasDiscard());
> }
> {code}
> The interesting thing of the fix is that both extra snippets are needed (either one or
the other is not enough) to prevent the issue from happening.

This message was sent by Atlassian JIRA

View raw message