From dev-return-5050-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org Fri Nov 30 02:18:36 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 4033818066C for ; Fri, 30 Nov 2018 02:18:35 +0100 (CET) Received: (qmail 72502 invoked by uid 500); 30 Nov 2018 01:18:34 -0000 Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mxnet.incubator.apache.org Delivered-To: mailing list dev@mxnet.incubator.apache.org Received: (qmail 72490 invoked by uid 99); 30 Nov 2018 01:18:33 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Nov 2018 01:18:33 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 032CFC1EAE for ; Fri, 30 Nov 2018 01:18:33 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.797 X-Spam-Level: * X-Spam-Status: No, score=1.797 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=googlemail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id Ewagmty8KX5O for ; Fri, 30 Nov 2018 01:18:30 +0000 (UTC) Received: from mail-qk1-f180.google.com (mail-qk1-f180.google.com [209.85.222.180]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 790C0610DE for ; Fri, 30 Nov 2018 01:18:29 +0000 (UTC) Received: by mail-qk1-f180.google.com with SMTP id o89so2304962qko.0 for ; Thu, 29 Nov 2018 17:18:29 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=lsOYyLHC1JrYyUdFZXaSeNGqZGVfRf+pV1OoIF5fTXM=; b=iilL+Mo674ZuNXH0wwGU8Ym0VG7dMLnr9GHuT22h7oYW6v99P2muiGHH66D0g5HQCY un47AKO2iYo0bakJWi3KYD9WZ1IOGowpvlBQ0HsrJ7S2b2s2/Z2q1MXAgmFGfKxpam8F Xe/RHEgUZUDxGl2HW7xbQuQoabmr3gXHYT+h+PRyO+lj5g4Uko5Hs7rWi5BYktrp9YAn SKu2hHIUcNFNfi2QXA+QhHrdQj1HOX72RaaBhdoUyosIgQ+ic3oX9Sw04/+8mYkhgh95 UmCI12L2YQG/E0gEB9AaQFmP6Q16HGhbKJVaddoG5z7CuUWEY60I9cxqs4OnpWkY8L4t yT1A== X-Gm-Message-State: AA+aEWYxeRQXRE6xNOVecL6sz2U4jj0PNJexcZ7l/9zEz7NaTrYERnmU /xye62xVlOJWb8/I7mRV9WboCGk4IcDKJXCnVf4ZIowVSIU= X-Google-Smtp-Source: AFSGD/WKXCjLJ+dG6+Rj6dAujwIuJB3k5y9MnEA19boKFN7ZoKYp0nhEsaMch44xIKdbk0f0M6R5kmaUrWPJE7Yuydo= X-Received: by 2002:a37:2d82:: with SMTP id t124mr3621380qkh.122.1543540707899; Thu, 29 Nov 2018 17:18:27 -0800 (PST) MIME-Version: 1.0 References: <8738FB41-F9E5-4256-BE09-F8ED52EEEB55@gmail.com> In-Reply-To: <8738FB41-F9E5-4256-BE09-F8ED52EEEB55@gmail.com> From: Marco de Abreu Date: Fri, 30 Nov 2018 02:17:51 +0100 Message-ID: Subject: Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release To: dev@mxnet.incubator.apache.org Content-Type: multipart/alternative; boundary="000000000000973cfc057bd79313" --000000000000973cfc057bd79313 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Naveen, yeah sorry, that's DockerHub acting up again (this happens every now and then unfortunately). Basically docker pull starts multiple download threads and it seems like sometimes a single web server request sits in the queue forever which then slows down the docker pull (for the cache retrieval). Chance will be assisting with CI issues this week and I explained him my proposed solution: Basically wrap the 'docker pull' into a timeout in combination with a retry with backoff. Anton proposed, in case that retry fails after a few times, we are falling back to local cache and cache regeneration to avoid the job failing. That would solve the problem you're encountering. We would basically wrap [1] into the timeout-retry-mechanism. Best regards, Marco [1]: https://github.com/apache/incubator-mxnet/blob/master/ci/docker_cache.py#L1= 07 On Fri, Nov 30, 2018 at 2:01 AM Joshua Z. Zhang wrote: > Hi, I would like to bring a critical performance and stability patch of > existing gluon dataloader to 1.4.0: > https://github.com/apache/incubator-mxnet/pull/13447 < > https://github.com/apache/incubator-mxnet/pull/13447>. > > This PR is finished, waiting for CI to pass. > > Steffen, could you help me add that to the tracked list? > > Best, > Zhi > > > On Nov 29, 2018, at 4:25 PM, Naveen Swamy wrote: > > > > the tests are randomly failing in different stages > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubato= r-mxnet/detail/PR-13105/ > > This PR has failed 8 times so far > > > > On Thu, Nov 29, 2018 at 3:43 PM Steffen Rochel > > wrote: > > > >> Pedro - ok. Please add PR to v1.4.x branch after merge to master and > please > >> update tracking page > >> < > >> > https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubat= ing%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePla= nandStatus-OpenPRstotrack > >>> > >> . > >> Steffen > >> > >> On Thu, Nov 29, 2018 at 3:00 PM Pedro Larroy < > pedro.larroy.lists@gmail.com > >>> > >> wrote: > >> > >>> PR is ready from my side and passes the tests, unless somebody raises > >>> any concerns it's good to go. > >>> On Thu, Nov 29, 2018 at 9:50 PM Steffen Rochel < > steffenrochel@gmail.com> > >>> wrote: > >>>> > >>>> Pedro - added to 1.4.0 tracking list > >>>> < > >>> > >> > https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubat= ing%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePla= nandStatus-OpenPRstotrack > >>>> > >>>> > >>>> Do you have already ETA? > >>>> Steffen > >>>> > >>>> On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy < > >>> pedro.larroy.lists@gmail.com> > >>>> wrote: > >>>> > >>>>> Hi all. > >>>>> > >>>>> There are two important issues / fixes that should go in the next > >>>>> release in my radar: > >>>>> > >>>>> 1) https://github.com/apache/incubator-mxnet/pull/13409/files > >>>>> There is a bug in shape inference on CPU when not using MKL, also w= e > >>>>> are running activation on CPU via MKL when we compile CUDNN+MKLDNN. > >>>>> I'm finishing a fix for these issues in the above PR. > >>>>> > >>>>> 2) https://github.com/apache/incubator-mxnet/issues/13438 > >>>>> We are seeing crashes due to unsafe setenv in multithreaded code. > >>>>> Setenv / getenv from multiple threads is not safe and is causing > >>>>> segfaults. This piece of code (the handlers in pthread_atfork) > >> already > >>>>> caused a very difficult to diagnose hang in a previous release, whe= re > >>>>> a fork inside cudnn would deadlock the engine. > >>>>> > >>>>> I would remove setenv from 2) as a mitigation, but we would need to > >>>>> check for regressions as we could be creating additional threads > >>>>> inside the engine. > >>>>> > >>>>> I would suggest that we address these two major issues before the > >> next > >>>>> release. > >>>>> > >>>>> Pedro > >>>>> > >>>>> > >>>>> > >>>>> On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel < > >>> steffenrochel@gmail.com> > >>>>> wrote: > >>>>>> > >>>>>> Dear MXNet community, > >>>>>> > >>>>>> I will be the release manager for the upcoming Apache MXNet 1.4.0 > >>>>> release. > >>>>>> Sergey Kolychev will be co-managing the release and providing help > >>> from > >>>>> the > >>>>>> committers side. > >>>>>> A release candidate will be cut on November 29, 2018 and voting > >> will > >>>>> start > >>>>>> December 7, 2018. Release notes have been drafted here [1]. If you > >>> have > >>>>> any > >>>>>> additional features in progress and would like to include it in > >> this > >>>>>> release, please assure they have been merged by November 27, 2018. > >>>>> Release > >>>>>> schedule is available here [2]. > >>>>>> > >>>>>> Feel free to add any other comments/suggestions. Please help to > >>> review > >>>>> and > >>>>>> merge outstanding PR's and resolve issues impacting the quality of > >>> the > >>>>>> 1.4.0 release. > >>>>>> > >>>>>> Regards, > >>>>>> > >>>>>> Steffen > >>>>>> > >>>>>> [1] > >>>>>> > >>>>> > >>> > >> > https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubat= ing%29+1.4.0+Release+Notes > >>>>>> > >>>>>> [2] > >>>>> > >>> > >> > https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubat= ing%29+1.4.0+Release+Plan+and+Status > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland < > >>>>>> kellen.sunderland@gmail.com> wrote: > >>>>>> > >>>>>>> Spoke too soon[1], looks like others have been adding Turing > >>> support as > >>>>>>> well (thanks to those helping with this). I believe there's > >> still > >>> a > >>>>> few > >>>>>>> changes we'd have to make to claim support though (mshadow CMake > >>>>> changes, > >>>>>>> PyPi package creation tweaks). > >>>>>>> > >>>>>>> 1: > >>>>>>> > >>>>>>> > >>>>> > >>> > >> > https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89= f278264ce10c2f08 > >>>>>>> > >>>>>>> On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland < > >>>>>>> kellen.sunderland@gmail.com> wrote: > >>>>>>> > >>>>>>>> Hey Steffen, I'd like to be able to merge this PR for version > >>> 1.4: > >>>>>>>> https://github.com/apache/incubator-mxnet/pull/13310 . It > >> fixes > >>> a > >>>>>>>> regression in master which causes incorrect feature vectors to > >> be > >>>>> output > >>>>>>>> when using the TensorRT feature. (Thanks to Nathalie for > >>> helping me > >>>>>>> track > >>>>>>>> down the root cause of the issue). I'm currently blocked on a > >>> CI > >>>>> issue > >>>>>>> I > >>>>>>>> haven't seen before, but hope to have it resolved by EOW. > >>>>>>>> > >>>>>>>> One call-out I would make is that we currently don't support > >>> Turing > >>>>>>>> architecture (sm_75). I've been slowly trying to add support, > >>> but I > >>>>>>> don't > >>>>>>>> think I'd have capacity to do this done by EOW. Does anyone > >> feel > >>>>>>> strongly > >>>>>>>> we need this in the 1.4 release? From my perspective this will > >>>>> already > >>>>>>> be > >>>>>>>> a strong release without it. > >>>>>>>> > >>>>>>>> On Tue, Nov 20, 2018 at 6:42 PM Steffen Rochel < > >>>>> steffenrochel@gmail.com> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Thanks Patrick, lets target to get the PR's merged this week. > >>>>>>>>> > >>>>>>>>> Call for contributions from the community: Right now we have > >> 10 > >>> PR > >>>>>>>>> awaiting > >>>>>>>>> merge > >>>>>>>>> < > >>>>>>>>> > >>>>>>> > >>>>> > >>> > >> > https://github.com/apache/incubator-mxnet/pulls?utf8=3D%E2%9C%93&q=3Dis%3= Apr+is%3Aopen+label%3Apr-awaiting-merge+ > >>>>>>>>>> > >>>>>>>>> and > >>>>>>>>> we have 61 open PR awaiting review. > >>>>>>>>> < > >>>>>>>>> > >>>>>>> > >>>>> > >>> > >> > https://github.com/apache/incubator-mxnet/pulls?utf8=3D%E2%9C%93&q=3Dis%3= Apr+is%3Aopen+label%3Apr-awaiting-review > >>>>>>>>>> > >>>>>>>>> I would appreciate if you all can help to review the open PR > >>> and the > >>>>>>>>> committers can drive the merge before code freeze for 1.4.0. > >>>>>>>>> > >>>>>>>>> The contributors on the Java API are making progress, but not > >>> all > >>>>>>>>> performance issues are resolved. With some luck it should be > >>>>> possible to > >>>>>>>>> code freeze towards end of this week. > >>>>>>>>> > >>>>>>>>> Are there other critical features/bugs/PR you think need to be > >>>>> included > >>>>>>> in > >>>>>>>>> 1.4.0? If so, please communicate as soon as possible. > >>>>>>>>> > >>>>>>>>> Regards, > >>>>>>>>> Steffen > >>>>>>>>> > >>>>>>>>> On Mon, Nov 19, 2018 at 8:26 PM Zhao, Patric < > >>> patric.zhao@intel.com > >>>>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Thanks, Steffen. I think there is NO open issue to block the > >>>>> MKLDNN to > >>>>>>>>> GA > >>>>>>>>>> now. > >>>>>>>>>> > >>>>>>>>>> BTW, several quantization related PRs (#13297,#13260) are > >>> under > >>>>> the > >>>>>>>>> review > >>>>>>>>>> and I think it can be merged in this week. > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> > >>>>>>>>>> --Patric > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> -----Original Message----- > >>>>>>>>>>> From: Steffen Rochel [mailto:steffenrochel@gmail.com] > >>>>>>>>>>> Sent: Tuesday, November 20, 2018 2:57 AM > >>>>>>>>>>> To: dev@mxnet.incubator.apache.org > >>>>>>>>>>> Subject: Re: [Announce] Upcoming Apache MXNet (incubating) > >>> 1.4.0 > >>>>>>>>> release > >>>>>>>>>>> > >>>>>>>>>>> On Friday the contributors working on Java API discovered > >> a > >>>>>>> potential > >>>>>>>>>>> performance problem with inference using Java API vs. > >>> Python. > >>>>>>>>>> Investigation > >>>>>>>>>>> is ongoing. > >>>>>>>>>>> As the Java API is one of the main features for the > >> upcoming > >>>>>>> release, > >>>>>>>>> I > >>>>>>>>>>> suggest to post-pone the code freeze towards end of this > >>> week. > >>>>>>>>>>> > >>>>>>>>>>> Please provide feedback and concern about the change in > >>> dates > >>>>> for > >>>>>>> code > >>>>>>>>>>> freeze and 1.4.0 release. I will provide updates on > >> progress > >>>>>>> resolving > >>>>>>>>>> the > >>>>>>>>>>> potential performance problem. > >>>>>>>>>>> > >>>>>>>>>>> Patrick - do you think it is possible to resolve the > >>> remaining > >>>>>>> issues > >>>>>>>>> on > >>>>>>>>>> MKL- > >>>>>>>>>>> DNN this week, so we can consider GA for MKL-DNN with > >> 1.4.0? > >>>>>>>>>>> > >>>>>>>>>>> Regards, > >>>>>>>>>>> Steffen > >>>>>>>>>>> > >>>>>>>>>>> On Thu, Nov 15, 2018 at 5:26 AM Anton Chernov < > >>>>> mechernov@gmail.com> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> I'd like to remind everyone that 'code freeze' would > >> mean > >>>>> cutting > >>>>>>> a > >>>>>>>>>>>> v1.4.x release branch and all following fixes would need > >>> to be > >>>>>>>>>> backported. > >>>>>>>>>>>> Development on master can be continued as usual. > >>>>>>>>>>>> > >>>>>>>>>>>> Best > >>>>>>>>>>>> Anton > >>>>>>>>>>>> > >>>>>>>>>>>> =D1=81=D1=80, 14 =D0=BD=D0=BE=D1=8F=D0=B1. 2018 =D0=B3. =D0= =B2 6:04, Steffen Rochel < > >>>>>>>>> steffenrochel@gmail.com>: > >>>>>>>>>>>> > >>>>>>>>>>>>> Dear MXNet community, > >>>>>>>>>>>>> the agreed plan was to establish code freeze for 1.4.0 > >>>>> release > >>>>>>>>>>>>> today. As the 1.3.1 patch release is still ongoing I > >>>>> suggest to > >>>>>>>>>>>>> post-pone the code freeze to Friday 16th November > >> 2018. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Sergey Kolychev has agreed to act as co-release > >> manager > >>> for > >>>>> all > >>>>>>>>>>>>> tasks > >>>>>>>>>>>> which > >>>>>>>>>>>>> require committer privileges. If anybody is interested > >>> to > >>>>>>>>> volunteer > >>>>>>>>>>>>> as release manager - now is the time to speak up. > >>> Otherwise > >>>>> I > >>>>>>> will > >>>>>>>>>>>>> manage > >>>>>>>>>>>> the > >>>>>>>>>>>>> release. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Regards, > >>>>>>>>>>>>> Steffen > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>> > >>> > >> > > --000000000000973cfc057bd79313--