mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <marco.g.ab...@googlemail.com.INVALID>
Subject Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release
Date Fri, 30 Nov 2018 01:17:51 GMT
Hi Naveen,

yeah sorry, that's DockerHub acting up again (this happens every now and
then unfortunately). Basically docker pull starts multiple download threads
and it seems like sometimes a single web server request sits in the queue
forever which then slows down the docker pull (for the cache retrieval).

Chance will be assisting with CI issues this week and I explained him my
proposed solution: Basically wrap the 'docker pull' into a timeout in
combination with a retry with backoff. Anton proposed, in case that retry
fails after a few times, we are falling back to local cache and cache
regeneration to avoid the job failing. That would solve the problem you're
encountering. We would basically wrap [1] into the timeout-retry-mechanism.

Best regards,
Marco

[1]:
https://github.com/apache/incubator-mxnet/blob/master/ci/docker_cache.py#L107

On Fri, Nov 30, 2018 at 2:01 AM Joshua Z. Zhang <cheungchih@gmail.com>
wrote:

> Hi, I would like to bring a critical performance and stability patch of
> existing gluon dataloader to 1.4.0:
> https://github.com/apache/incubator-mxnet/pull/13447 <
> https://github.com/apache/incubator-mxnet/pull/13447>.
>
> This PR is finished, waiting for CI to pass.
>
> Steffen, could you help me add that to the tracked list?
>
> Best,
> Zhi
>
> > On Nov 29, 2018, at 4:25 PM, Naveen Swamy <mnnaveen@gmail.com> wrote:
> >
> > the tests are randomly failing in different stages
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-13105/
> > This PR has failed 8 times so far
> >
> > On Thu, Nov 29, 2018 at 3:43 PM Steffen Rochel <steffenrochel@gmail.com>
> > wrote:
> >
> >> Pedro - ok. Please add PR to v1.4.x branch after merge to master and
> please
> >> update tracking page
> >> <
> >>
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePlanandStatus-OpenPRstotrack
> >>>
> >> .
> >> Steffen
> >>
> >> On Thu, Nov 29, 2018 at 3:00 PM Pedro Larroy <
> pedro.larroy.lists@gmail.com
> >>>
> >> wrote:
> >>
> >>> PR is ready from my side and passes the tests, unless somebody raises
> >>> any concerns it's good to go.
> >>> On Thu, Nov 29, 2018 at 9:50 PM Steffen Rochel <
> steffenrochel@gmail.com>
> >>> wrote:
> >>>>
> >>>> Pedro - added  to 1.4.0 tracking list
> >>>> <
> >>>
> >>
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePlanandStatus-OpenPRstotrack
> >>>>
> >>>>
> >>>> Do you have already ETA?
> >>>> Steffen
> >>>>
> >>>> On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy <
> >>> pedro.larroy.lists@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi all.
> >>>>>
> >>>>> There are two important issues / fixes that should go in the next
> >>>>> release in my radar:
> >>>>>
> >>>>> 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> >>>>> There is a bug in shape inference on CPU when not using MKL, also
we
> >>>>> are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> >>>>> I'm finishing a fix for these issues in the above PR.
> >>>>>
> >>>>> 2) https://github.com/apache/incubator-mxnet/issues/13438
> >>>>> We are seeing crashes due to unsafe setenv in multithreaded code.
> >>>>> Setenv / getenv from multiple threads is not safe and is causing
> >>>>> segfaults. This piece of code (the handlers in pthread_atfork)
> >> already
> >>>>> caused a very difficult to diagnose hang in a previous release,
where
> >>>>> a fork inside cudnn would deadlock the engine.
> >>>>>
> >>>>> I would remove setenv from 2) as a mitigation, but we would need
to
> >>>>> check for regressions as we could be creating additional threads
> >>>>> inside the engine.
> >>>>>
> >>>>> I would suggest that we address these two major issues before the
> >> next
> >>>>> release.
> >>>>>
> >>>>> Pedro
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel <
> >>> steffenrochel@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Dear MXNet community,
> >>>>>>
> >>>>>> I will be the release manager for the upcoming Apache MXNet
1.4.0
> >>>>> release.
> >>>>>> Sergey Kolychev will be co-managing the release and providing
help
> >>> from
> >>>>> the
> >>>>>> committers side.
> >>>>>> A release candidate will be cut on November 29, 2018 and voting
> >> will
> >>>>> start
> >>>>>> December 7, 2018. Release notes have been drafted here [1].
If you
> >>> have
> >>>>> any
> >>>>>> additional features in progress and would like to include it
in
> >> this
> >>>>>> release, please assure they have been merged by November 27,
2018.
> >>>>> Release
> >>>>>> schedule is available here [2].
> >>>>>>
> >>>>>> Feel free to add any other comments/suggestions. Please help
to
> >>> review
> >>>>> and
> >>>>>> merge outstanding PR's and resolve issues impacting the quality
of
> >>> the
> >>>>>> 1.4.0 release.
> >>>>>>
> >>>>>> Regards,
> >>>>>>
> >>>>>> Steffen
> >>>>>>
> >>>>>> [1]
> >>>>>>
> >>>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
> >>>>>>
> >>>>>> [2]
> >>>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> >>>>>> kellen.sunderland@gmail.com> wrote:
> >>>>>>
> >>>>>>> Spoke too soon[1], looks like others have been adding Turing
> >>> support as
> >>>>>>> well (thanks to those helping with this).  I believe there's
> >> still
> >>> a
> >>>>> few
> >>>>>>> changes we'd have to make to claim support though (mshadow
CMake
> >>>>> changes,
> >>>>>>> PyPi package creation tweaks).
> >>>>>>>
> >>>>>>> 1:
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >>
> https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> >>>>>>>
> >>>>>>> On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> >>>>>>> kellen.sunderland@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Hey Steffen, I'd like to be able to merge this PR for
version
> >>> 1.4:
> >>>>>>>> https://github.com/apache/incubator-mxnet/pull/13310
. It
> >> fixes
> >>> a
> >>>>>>>> regression in master which causes incorrect feature
vectors to
> >> be
> >>>>> output
> >>>>>>>> when using the TensorRT feature.  (Thanks to Nathalie
for
> >>> helping me
> >>>>>>> track
> >>>>>>>> down the root cause of the issue).   I'm currently blocked
on a
> >>> CI
> >>>>> issue
> >>>>>>> I
> >>>>>>>> haven't seen before, but hope to have it resolved by
EOW.
> >>>>>>>>
> >>>>>>>> One call-out I would make is that we currently don't
support
> >>> Turing
> >>>>>>>> architecture (sm_75).  I've been slowly trying to add
support,
> >>> but I
> >>>>>>> don't
> >>>>>>>> think I'd have capacity to do this done by EOW.  Does
anyone
> >> feel
> >>>>>>> strongly
> >>>>>>>> we need this in the 1.4 release?  From my perspective
this will
> >>>>> already
> >>>>>>> be
> >>>>>>>> a strong release without it.
> >>>>>>>>
> >>>>>>>> On Tue, Nov 20, 2018 at 6:42 PM Steffen Rochel <
> >>>>> steffenrochel@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Thanks Patrick, lets target to get the PR's merged
this week.
> >>>>>>>>>
> >>>>>>>>> Call for contributions from the community: Right
now we have
> >> 10
> >>> PR
> >>>>>>>>> awaiting
> >>>>>>>>> merge
> >>>>>>>>> <
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >>
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge+
> >>>>>>>>>>
> >>>>>>>>> and
> >>>>>>>>> we have 61 open PR awaiting review.
> >>>>>>>>> <
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >>
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+label%3Apr-awaiting-review
> >>>>>>>>>>
> >>>>>>>>> I would appreciate if you all can help to review
the open PR
> >>> and the
> >>>>>>>>> committers can drive the merge before code freeze
for 1.4.0.
> >>>>>>>>>
> >>>>>>>>> The contributors on the Java API are making progress,
but not
> >>> all
> >>>>>>>>> performance issues are resolved. With some luck
it should be
> >>>>> possible to
> >>>>>>>>> code freeze towards end of this week.
> >>>>>>>>>
> >>>>>>>>> Are there other critical features/bugs/PR you think
need to be
> >>>>> included
> >>>>>>> in
> >>>>>>>>> 1.4.0? If so, please communicate as soon as possible.
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Steffen
> >>>>>>>>>
> >>>>>>>>> On Mon, Nov 19, 2018 at 8:26 PM Zhao, Patric <
> >>> patric.zhao@intel.com
> >>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Thanks, Steffen. I think there is NO open issue
to block the
> >>>>> MKLDNN to
> >>>>>>>>> GA
> >>>>>>>>>> now.
> >>>>>>>>>>
> >>>>>>>>>> BTW, several quantization related PRs (#13297,#13260)
are
> >>> under
> >>>>> the
> >>>>>>>>> review
> >>>>>>>>>> and I think it can be merged in this week.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>>
> >>>>>>>>>> --Patric
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>> From: Steffen Rochel [mailto:steffenrochel@gmail.com]
> >>>>>>>>>>> Sent: Tuesday, November 20, 2018 2:57 AM
> >>>>>>>>>>> To: dev@mxnet.incubator.apache.org
> >>>>>>>>>>> Subject: Re: [Announce] Upcoming Apache
MXNet (incubating)
> >>> 1.4.0
> >>>>>>>>> release
> >>>>>>>>>>>
> >>>>>>>>>>> On Friday the contributors working on Java
API discovered
> >> a
> >>>>>>> potential
> >>>>>>>>>>> performance problem with inference using
Java API vs.
> >>> Python.
> >>>>>>>>>> Investigation
> >>>>>>>>>>> is ongoing.
> >>>>>>>>>>> As the Java API is one of the main features
for the
> >> upcoming
> >>>>>>> release,
> >>>>>>>>> I
> >>>>>>>>>>> suggest to post-pone the code freeze towards
end of this
> >>> week.
> >>>>>>>>>>>
> >>>>>>>>>>> Please provide feedback and concern about
the change in
> >>> dates
> >>>>> for
> >>>>>>> code
> >>>>>>>>>>> freeze and 1.4.0 release. I will provide
updates on
> >> progress
> >>>>>>> resolving
> >>>>>>>>>> the
> >>>>>>>>>>> potential performance problem.
> >>>>>>>>>>>
> >>>>>>>>>>> Patrick - do you think it is possible to
resolve the
> >>> remaining
> >>>>>>> issues
> >>>>>>>>> on
> >>>>>>>>>> MKL-
> >>>>>>>>>>> DNN this week, so we can consider GA for
MKL-DNN with
> >> 1.4.0?
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Steffen
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Nov 15, 2018 at 5:26 AM Anton Chernov
<
> >>>>> mechernov@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> I'd like to remind everyone that 'code
freeze' would
> >> mean
> >>>>> cutting
> >>>>>>> a
> >>>>>>>>>>>> v1.4.x release branch and all following
fixes would need
> >>> to be
> >>>>>>>>>> backported.
> >>>>>>>>>>>> Development on master can be continued
as usual.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best
> >>>>>>>>>>>> Anton
> >>>>>>>>>>>>
> >>>>>>>>>>>> ср, 14 нояб. 2018 г. в 6:04,
Steffen Rochel <
> >>>>>>>>> steffenrochel@gmail.com>:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Dear MXNet community,
> >>>>>>>>>>>>> the agreed plan was to establish
code freeze for 1.4.0
> >>>>> release
> >>>>>>>>>>>>> today. As the 1.3.1 patch release
is still ongoing I
> >>>>> suggest to
> >>>>>>>>>>>>> post-pone the code freeze to Friday
16th November
> >> 2018.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Sergey Kolychev has agreed to act
as co-release
> >> manager
> >>> for
> >>>>> all
> >>>>>>>>>>>>> tasks
> >>>>>>>>>>>> which
> >>>>>>>>>>>>> require committer privileges. If
anybody is interested
> >>> to
> >>>>>>>>> volunteer
> >>>>>>>>>>>>> as release manager - now is the
time to speak up.
> >>> Otherwise
> >>>>> I
> >>>>>>> will
> >>>>>>>>>>>>> manage
> >>>>>>>>>>>> the
> >>>>>>>>>>>>> release.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Steffen
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message