Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C5B37200BF8 for ; Fri, 13 Jan 2017 16:20:30 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id C47EC160B32; Fri, 13 Jan 2017 15:20:30 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 1A9B2160B3F for ; Fri, 13 Jan 2017 16:20:29 +0100 (CET) Received: (qmail 46086 invoked by uid 500); 13 Jan 2017 15:20:29 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 45926 invoked by uid 99); 13 Jan 2017 15:20:29 -0000 Received: from Unknown (HELO jira-lw-us.apache.org) (207.244.88.139) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Jan 2017 15:20:29 +0000 Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 87F3B25288 for ; Fri, 13 Jan 2017 15:20:26 +0000 (UTC) Date: Fri, 13 Jan 2017 15:20:26 +0000 (UTC) From: "Alexander Rojas (JIRA)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (MESOS-6907) FutureTest.After3 is flaky MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 13 Jan 2017 15:20:31 -0000 [ https://issues.apache.org/jira/browse/MESOS-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rojas updated MESOS-6907: ----------------------------------- Description: There is apparently a race condition between the time an instance of {{Future}} goes out of scope and when the enclosing data is actually deleted, if {{Future::after(Duration, lambda::function(const Future&)>)}} is called. The issue is more likely to occur if the machine is under load or if it is not a very powerful one. The easiest way to reproduce it is to run: {code} $ stress -c 4 -t 2600 -d 2 -i 2 & $ ./libprocess-tests --gtest_filter="FutureTest.After3" --gtest_repeat=-1 --gtest_break_on_failure {code} An exploratory fix for the issue is to change the test to: {code} TEST(FutureTest, After3) { Future future; process::WeakFuture weak_future(future); EXPECT_SOME(weak_future.get()); { Clock::pause(); // The original future disappears here. After this call the // original future goes out of scope and should not be reachable // anymore. future = future .after(Milliseconds(1), [](Future f) { f.discard(); return Nothing(); }); Clock::advance(Seconds(2)); Clock::settle(); AWAIT_READY(future); } if (weak_future.get().isSome()) { os::sleep(Seconds(1)); } EXPECT_NONE(weak_future.get()); EXPECT_FALSE(future.hasDiscard()); } {code} The interesting thing of the fix is that both extra snippets are needed (either one or the other is not enough) to prevent the issue from happening. was: After playing with the latest patch solving MESOS-6484 we found out that the modifications done introduce a flakyness in the test {{FutureTest.After3}}. The flakyness occurs, depending on the machine and the load of it between once every 10000 runs and once every 500000 runs, being most likely a race condition in the code. To reproduce run: {code} ${MESOS_BUILD_DIR}/3rdparty/libprocess/libprocess-tests --gtest_filter="*.After3" --gtest_repeat=-1 --gtest_break_on_failure {code} > FutureTest.After3 is flaky > -------------------------- > > Key: MESOS-6907 > URL: https://issues.apache.org/jira/browse/MESOS-6907 > Project: Mesos > Issue Type: Bug > Components: libprocess > Reporter: Alexander Rojas > > There is apparently a race condition between the time an instance of {{Future}} goes out of scope and when the enclosing data is actually deleted, if {{Future::after(Duration, lambda::function(const Future&)>)}} is called. > The issue is more likely to occur if the machine is under load or if it is not a very powerful one. The easiest way to reproduce it is to run: > {code} > $ stress -c 4 -t 2600 -d 2 -i 2 & > $ ./libprocess-tests --gtest_filter="FutureTest.After3" --gtest_repeat=-1 --gtest_break_on_failure > {code} > An exploratory fix for the issue is to change the test to: > {code} > TEST(FutureTest, After3) > { > Future future; > process::WeakFuture weak_future(future); > EXPECT_SOME(weak_future.get()); > { > Clock::pause(); > // The original future disappears here. After this call the > // original future goes out of scope and should not be reachable > // anymore. > future = future > .after(Milliseconds(1), [](Future f) { > f.discard(); > return Nothing(); > }); > Clock::advance(Seconds(2)); > Clock::settle(); > AWAIT_READY(future); > } > if (weak_future.get().isSome()) { > os::sleep(Seconds(1)); > } > EXPECT_NONE(weak_future.get()); > EXPECT_FALSE(future.hasDiscard()); > } > {code} > The interesting thing of the fix is that both extra snippets are needed (either one or the other is not enough) to prevent the issue from happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332)