mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Rojas (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-6907) FutureTest.After3 is flaky
Date Fri, 13 Jan 2017 15:20:26 GMT

     [ https://issues.apache.org/jira/browse/MESOS-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexander Rojas updated MESOS-6907:
-----------------------------------
    Description: 
There is apparently a race condition between the time an instance of {{Future<T>}} goes
out of scope and when the enclosing data is actually deleted, if {{Future<T>::after(Duration,
lambda::function<Future<T>(const Future<T>&)>)}} is called.

The issue is more likely to occur if the machine is under load or if it is not a very powerful
one. The easiest way to reproduce it is to run:

{code}
$ stress -c 4 -t 2600 -d 2 -i 2 &
$ ./libprocess-tests --gtest_filter="FutureTest.After3" --gtest_repeat=-1 --gtest_break_on_failure
{code}

An exploratory fix for the issue is to change the test to:

{code}
TEST(FutureTest, After3)
{
  Future<Nothing> future;
  process::WeakFuture<Nothing> weak_future(future);

  EXPECT_SOME(weak_future.get());

  {
    Clock::pause();
    // The original future disappears here. After this call the
    // original future goes out of scope and should not be reachable
    // anymore.
    future = future
      .after(Milliseconds(1), [](Future<Nothing> f) {
        f.discard();
        return Nothing();
      });

    Clock::advance(Seconds(2));
    Clock::settle();

    AWAIT_READY(future);
  }

  if (weak_future.get().isSome()) {
    os::sleep(Seconds(1));
  }

  EXPECT_NONE(weak_future.get());
  EXPECT_FALSE(future.hasDiscard());
}
{code}

The interesting thing of the fix is that both extra snippets are needed (either one or the
other is not enough) to prevent the issue from happening.


  was:
After playing with the latest patch solving MESOS-6484 we found out that the modifications
done introduce a flakyness in the test {{FutureTest.After3}}. The flakyness occurs, depending
on the machine and the load of it between once every 10000 runs and once every 500000 runs,
being most likely a race condition in the code.

To reproduce run:

{code}
${MESOS_BUILD_DIR}/3rdparty/libprocess/libprocess-tests --gtest_filter="*.After3" --gtest_repeat=-1
--gtest_break_on_failure
{code}


> FutureTest.After3 is flaky
> --------------------------
>
>                 Key: MESOS-6907
>                 URL: https://issues.apache.org/jira/browse/MESOS-6907
>             Project: Mesos
>          Issue Type: Bug
>          Components: libprocess
>            Reporter: Alexander Rojas
>
> There is apparently a race condition between the time an instance of {{Future<T>}}
goes out of scope and when the enclosing data is actually deleted, if {{Future<T>::after(Duration,
lambda::function<Future<T>(const Future<T>&)>)}} is called.
> The issue is more likely to occur if the machine is under load or if it is not a very
powerful one. The easiest way to reproduce it is to run:
> {code}
> $ stress -c 4 -t 2600 -d 2 -i 2 &
> $ ./libprocess-tests --gtest_filter="FutureTest.After3" --gtest_repeat=-1 --gtest_break_on_failure
> {code}
> An exploratory fix for the issue is to change the test to:
> {code}
> TEST(FutureTest, After3)
> {
>   Future<Nothing> future;
>   process::WeakFuture<Nothing> weak_future(future);
>   EXPECT_SOME(weak_future.get());
>   {
>     Clock::pause();
>     // The original future disappears here. After this call the
>     // original future goes out of scope and should not be reachable
>     // anymore.
>     future = future
>       .after(Milliseconds(1), [](Future<Nothing> f) {
>         f.discard();
>         return Nothing();
>       });
>     Clock::advance(Seconds(2));
>     Clock::settle();
>     AWAIT_READY(future);
>   }
>   if (weak_future.get().isSome()) {
>     os::sleep(Seconds(1));
>   }
>   EXPECT_NONE(weak_future.get());
>   EXPECT_FALSE(future.hasDiscard());
> }
> {code}
> The interesting thing of the fix is that both extra snippets are needed (either one or
the other is not enough) to prevent the issue from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message