From dev-return-2794-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org Sat May 5 00:05:31 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 4E19C180634 for ; Sat, 5 May 2018 00:05:30 +0200 (CEST) Received: (qmail 82982 invoked by uid 500); 4 May 2018 22:05:29 -0000 Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mxnet.incubator.apache.org Delivered-To: mailing list dev@mxnet.incubator.apache.org Received: (qmail 82959 invoked by uid 99); 4 May 2018 22:05:28 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 May 2018 22:05:28 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 46A03C97A8 for ; Fri, 4 May 2018 22:05:28 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.639 X-Spam-Level: *** X-Spam-Status: No, score=3.639 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_REPLY=1, HTML_MESSAGE=2, KAM_NUMSUBJECT=0.5, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id uKoX4nAmUdUe for ; Fri, 4 May 2018 22:05:25 +0000 (UTC) Received: from mail-io0-f177.google.com (mail-io0-f177.google.com [209.85.223.177]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id E9BA95F4AA for ; Fri, 4 May 2018 22:05:24 +0000 (UTC) Received: by mail-io0-f177.google.com with SMTP id d73-v6so27326185iog.3 for ; Fri, 04 May 2018 15:05:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=R6tJCHEdY5EO6cKdp2joMey6dUfJch3hNbf8vaK3UEk=; b=VZ8Sc8eTVl7Q8cz0pj9VpHBVtmUOvhKUm8MHDy5UTrqbAwbUx+7dEXPh353QY/FQV4 SLoR/6F3DhguKGnCkHiik85HKyVA7qsGrFRAID02ZdQhBhBt8i6+D1HuwQir8ZE3pIi6 5YkXlQNqya+1ufqq/F5pXw8KDCfI8aiGxkndbIb+EL4XGsEnGj+B+CHV6V6hQHoiGLtF A9uATkzmzBAu8q93DFISHNYLc5mFi348ADSxAg6Iaksz1V6uaCpl9qcBaaJQeYze6oKe X94X1HMtIY9vJ1DG6h9rLL+UK+avISkaTPes+hHVR7x9wUfUm6IAvll4ZF1a9c/WDnqf pJnA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=R6tJCHEdY5EO6cKdp2joMey6dUfJch3hNbf8vaK3UEk=; b=kWRDJsYtEkDr/Uf/GOH3868xMsSQmhp6Vmxqb2jdR7+YgBh/4c6tYG6bEcKsaW0lWa sJez4MKinJmKeI0BYesFGn+0cB8bpdob+aYA9eCAWNPWqfGMZbh82diUJbMf6P6Sq+me ZEvsrSaE1ybgo5peyhO6pKT6KW2z7RZkuwLhcsNTek6NedRxFbEg8WFxokUeL7D0H0UY xXQ7PoNQLKvtP9u+V6FSzcX2DikezIsBmYgkflsSDZGW22EA1bu4C3kpXH7Ltj8Pov2J N4Z7T1oZvwhqTU13bDQTsMQgnReLsENQQKP9stXQsTX192MpeR9qh10yauvLu9VeC/Nr AMEA== X-Gm-Message-State: ALQs6tD7Bqyxs2151doKrBU+DUkzjEpPcU1dGwNcJqjaK4LaWNXZQQj3 Ao2f95neuYKTYB+qXsY1O4ilyFkriBePHBFBqVj3lOBl X-Google-Smtp-Source: AB8JxZrILr3IbbLDHxUHuUG9BUeRcIKjzk1vTmDNsdrP8TvK/TQVqkNhNa+T7UtlA2sWUZiN4E01AsxtjRA3drcsbkM= X-Received: by 2002:a6b:8753:: with SMTP id j80-v6mr30610371iod.14.1525471523223; Fri, 04 May 2018 15:05:23 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a4f:88c3:0:0:0:0:0 with HTTP; Fri, 4 May 2018 15:05:22 -0700 (PDT) In-Reply-To: References: From: Anirudh Date: Fri, 4 May 2018 15:05:22 -0700 Message-ID: Subject: Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2 To: dev@mxnet.incubator.apache.org Content-Type: multipart/alternative; boundary="00000000000041bd9c056b68843d" --00000000000041bd9c056b68843d Content-Type: text/plain; charset="UTF-8" Hi Pedro, Haibin, Indhu, Thank you for your inputs on the release. I ran the test: `test_module.py:test_forward_reshape` for 250k times with different seeds. I was unable to reproduce the issue on the release branch. If everything goes well with CI tests by Pedro running till Sunday, I think we should move forward with the release (given that we have enough +1s). Is it possible to trigger the CI on the 1.2 branch repeatedly or at a fixed schedule till Sunday? Anirudh On Fri, May 4, 2018 at 11:56 AM, Indhu wrote: > +1 > > I've been using CUDA build from this branch (built from source) on Ubuntu > for couple of days now and I haven't seen any issue. > > The flaky tests need to be fixed but this release need not be blocked for > that. > > > On Fri, May 4, 2018 at 11:32 AM, Haibin Lin > wrote: > > > I agree with Anirudh that the focus of the discussion should be limited > to > > the release branch, not the master branch. Anything that breaks on master > > but works on release branch should not block the release itself. > > > > > > Best, > > > > Haibin > > > > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy < > > pedro.larroy.lists@gmail.com> > > wrote: > > > > > I see your point. > > > > > > I checked the failures on the v1.2.0 branch and I don't see segfaults, > > just > > > minor failures due to flaky tests. > > > > > > I will trigger it repeatedly a few times until Sunday to have a and > > change > > > my vote accordingly. > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.2.0/ > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/ > > > incubator-mxnet/detail/v1.2.0/17/pipeline > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/ > > > incubator-mxnet/detail/v1.2.0/15/pipeline/ > > > > > > > > > Pedro. > > > > > > On Fri, May 4, 2018 at 7:16 PM, Anirudh wrote: > > > > > > > Hi Pedro, > > > > > > > > Thank you for the suggestions. I will try to reproduce this without > > fixed > > > > seeds and also run it for a longer time duration. > > > > Having said that, running unit tests over and over for a couple of > days > > > > will likely cause > > > > problems because there around 42 open issues for flaky tests: > > > > https://github.com/apache/incubator-mxnet/issues?q=is% > > > > 3Aopen+is%3Aissue+label%3AFlaky > > > > Also, the release branch has diverged from master around 3 weeks back > > and > > > > it doesn't have many of the changes merged to the master. > > > > So, my question essentially is, what will be your benchmark to accept > > the > > > > release ? > > > > Is it that we run the test which you provided on 1.2 without fixed > > seeds > > > > and for a longer duration without failures ? > > > > Or is it that all unit tests should pass over a period of 2 days > > without > > > > issues. This may require fixing all of the flaky tests which would > > delay > > > > the release by considerable amount of time. > > > > Or is it something else ? > > > > > > > > Anirudh > > > > > > > > > > > > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy < > > > pedro.larroy.lists@gmail.com > > > > > > > > > wrote: > > > > > > > > > Could you remove the fixed seeds and run it for a couple of hours > > with > > > an > > > > > additional loop? Also I would suggest running the unit tests over > > and > > > > over > > > > > for a couple of days if possible. > > > > > > > > > > > > > > > Pedro. > > > > > > > > > > On Thu, May 3, 2018 at 8:33 PM, Anirudh > > wrote: > > > > > > > > > > > Hi Pedro and Naveen, > > > > > > > > > > > > I am unable to reproduce this issue with MKLDNN on the master but > > not > > > > on > > > > > > the 1.2.RC2 branch. > > > > > > > > > > > > Did the following on 1.2.RC2 branch: > > > > > > > > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas > USE_DIST_KVSTORE=0 > > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1 > > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0 > > > > > > export MXNET_TEST_SEED=11 > > > > > > export MXNET_MODULE_SEED=812478194 > > > > > > export MXNET_TEST_COUNT=10000 > > > > > > nosetests-2.7 -v tests/python/unittest/test_ > > > > > module.py:test_forward_reshape > > > > > > > > > > > > Was able to do the 10k runs successfully. > > > > > > > > > > > > Anirudh > > > > > > > > > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh > > > wrote: > > > > > > > > > > > > > Hi Pedro and Naveen, > > > > > > > > > > > > > > Is this issue reproducible when MXNet is built with > USE_MKLDNN=0? > > > > > > > Also, there are a bunch of MKLDNN fixes that didn't go into the > > > > release > > > > > > > branch. Is this issue reproducible on the release branch ? > > > > > > > In my opinion, since we have marked MKLDNN as experimental > > feature > > > > for > > > > > > the > > > > > > > release, if it is confirmed to be a MKLDNN issue > > > > > > > we don't need to block the release on it. > > > > > > > > > > > > > > Anirudh > > > > > > > > > > > > > > On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy < > mnnaveen@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > >> Thanks for raising this issue Pedro. > > > > > > >> > > > > > > >> -1(binding) > > > > > > >> > > > > > > >> We were in a similar state for a while a year ago, a lot of > > effort > > > > > went > > > > > > to > > > > > > >> stabilize the tests and the CI. I have seen the PR builds are > > > > > > >> non-deterministic and you have to retry over and over (wasting > > > > > resources > > > > > > >> and time) and hope you get lucky. > > > > > > >> > > > > > > >> Look at the dashboard for master build > > > > > > >> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator- > > > > mxnet/job/master/ > > > > > > >> > > > > > > >> -Naveen > > > > > > >> > > > > > > >> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy < > > > > > > >> pedro.larroy.lists@gmail.com> > > > > > > >> wrote: > > > > > > >> > > > > > > >> > -1 nondeterminisitc failures on CI master: > > > > > > >> > https://issues.apache.org/jira/browse/MXNET-396 > > > > > > >> > > > > > > > >> > Was able to reproduce once in a fresh p3 instance with DLAMI > > > > can't > > > > > > >> > reproduce consistently. > > > > > > >> > > > > > > > >> > On Wed, May 2, 2018 at 9:51 PM, Anirudh < > > anirudh2290@gmail.com> > > > > > > wrote: > > > > > > >> > > > > > > > >> > > Hi all, > > > > > > >> > > > > > > > > >> > > As part of RC2 release, we have addressed bugs and some > > > concerns > > > > > > that > > > > > > >> > were > > > > > > >> > > raised. > > > > > > >> > > > > > > > > >> > > I would like to propose a vote to release Apache MXNet > > > > > (incubating) > > > > > > >> > version > > > > > > >> > > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and > > end > > > at > > > > > > >> 12:50 PM > > > > > > >> > > PDT, Sunday, May 6th. > > > > > > >> > > > > > > > > >> > > Link to release notes: > > > > > > >> > > https://cwiki.apache.org/confluence/display/MXNET/ > > > > > > >> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes > > > > > > >> > > > > > > > > >> > > Link to release candidate 1.2.0.rc2: > > > > > > >> > > https://github.com/apache/incubator-mxnet/releases/tag/ > > > > 1.2.0.rc2 > > > > > > >> > > > > > > > > >> > > Voting results for 1.2.0.rc2: > > > > > > >> > > https://lists.apache.org/thread.html/ > > > > > ebe561c609a8e32351dfe4aafc8876 > > > > > > >> > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E > > > > > > >> > > > > > > > > >> > > View this page, click on "Build from Source", and use the > > > source > > > > > > code > > > > > > >> > > obtained from 1.2.0.rc2 tag: > > > > > > >> > > https://mxnet.incubator.apache.org/install/index.html > > > > > > >> > > > > > > > > >> > > (Note: The README.md points to the 1.2.0 tag and does not > > work > > > > at > > > > > > the > > > > > > >> > > moment.) > > > > > > >> > > > > > > > > >> > > Please remember to test first before voting accordingly: > > > > > > >> > > > > > > > > >> > > +1 = approve > > > > > > >> > > +0 = no opinion > > > > > > >> > > -1 = disapprove (provide reason) > > > > > > >> > > > > > > > > >> > > Anirudh > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --00000000000041bd9c056b68843d--