mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Olivier <cjolivie...@gmail.com>
Subject Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release
Date Thu, 29 Nov 2018 17:15:22 GMT
I don’t think that does anything at all, as stated in my other email.
Someone can look into the omp code to be sure but my suspicion is that the
environment variable is only read on startup, and at any rate, better to be
set through the api at runtime

On Thu, Nov 29, 2018 at 8:11 AM Pedro Larroy <pedro.larroy.lists@gmail.com>
wrote:

> To be precise, what would be the consequences of not having these env
> variables set in the engine threads related to OMP?
> Given your experience with OpenMP I hope you can help us answer these
> questions.
>
> Hopefully we can get the same effect (if any) of these setenvs using
> some openmp call or a pragma. Definitely we shouldn't be mutating the
> environment from a different thread from what I understand, which is
> the likely cause of the random crashes some users are experiencing.
>
> Pedro
> On Thu, Nov 29, 2018 at 5:00 PM Pedro Larroy
> <pedro.larroy.lists@gmail.com> wrote:
> >
> > Chris.  The problem is with setenv, not with getenv. We don't want to
> > remove any getenv call, just these misplaced setenvs:
> >
> >
> >
> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L61
> >
> > Please check the code above carefully and give us your feedback. Based
> > on your email I think we don't yet have a common understanding of the
> > root cause of this issue.
> >
> > Pedro.
> > On Thu, Nov 29, 2018 at 4:02 PM Chris Olivier <cjolivier01@gmail.com>
> wrote:
> > >
> > > - getenv should be thread safe as long as nothing is calling
> putenv/setenv
> > > in another thread (the environment doesn’t change) as stated here:
> > >
> > > http://www.cplusplus.com/reference/cstdlib/getenv/
> > >
> > > it’s a simple library call, so to be sure either way, one can check the
> > > actual source and see (in case some particular implementation is
> acting in
> > > a particularly thread-unsafe manner). This should be vetted before
> making
> > > any high-impact decisions such as trying to go remove every getenv
> call in
> > > the whole system.
> > >
> > > - locking after fork is possibly due to libgomp not supporting forking
> such
> > > that after a fork, a call is made to release the blocked omp threads
> and
> > > the main thread waits for the omp threads to finish, but the omp
> threads
> > > belong to the pre-forked process and thus never execute, causing that
> > > forked process to freeze.  This behavior has been witnessed before.
> > >
> > >
> > >
> > >
> > > On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy <
> pedro.larroy.lists@gmail.com>
> > > wrote:
> > >
> > > > Hi all.
> > > >
> > > > There are two important issues / fixes that should go in the next
> > > > release in my radar:
> > > >
> > > > 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> > > > There is a bug in shape inference on CPU when not using MKL, also we
> > > > are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> > > > I'm finishing a fix for these issues in the above PR.
> > > >
> > > > 2) https://github.com/apache/incubator-mxnet/issues/13438
> > > > We are seeing crashes due to unsafe setenv in multithreaded code.
> > > > Setenv / getenv from multiple threads is not safe and is causing
> > > > segfaults. This piece of code (the handlers in pthread_atfork)
> already
> > > > caused a very difficult to diagnose hang in a previous release, where
> > > > a fork inside cudnn would deadlock the engine.
> > > >
> > > > I would remove setenv from 2) as a mitigation, but we would need to
> > > > check for regressions as we could be creating additional threads
> > > > inside the engine.
> > > >
> > > > I would suggest that we address these two major issues before the
> next
> > > > release.
> > > >
> > > > Pedro
> > > >
> > > >
> > > >
> > > > On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel <
> steffenrochel@gmail.com>
> > > > wrote:
> > > > >
> > > > > Dear MXNet community,
> > > > >
> > > > > I will be the release manager for the upcoming Apache MXNet 1.4.0
> > > > release.
> > > > > Sergey Kolychev will be co-managing the release and providing help
> from
> > > > the
> > > > > committers side.
> > > > > A release candidate will be cut on November 29, 2018 and voting
> will
> > > > start
> > > > > December 7, 2018. Release notes have been drafted here [1]. If you
> have
> > > > any
> > > > > additional features in progress and would like to include it in
> this
> > > > > release, please assure they have been merged by November 27, 2018.
> > > > Release
> > > > > schedule is available here [2].
> > > > >
> > > > > Feel free to add any other comments/suggestions. Please help to
> review
> > > > and
> > > > > merge outstanding PR's and resolve issues impacting the quality of
> the
> > > > > 1.4.0 release.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Steffen
> > > > >
> > > > > [1]
> > > > >
> > > >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
> > > > >
> > > > > [2]
> > > >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> > > > > kellen.sunderland@gmail.com> wrote:
> > > > >
> > > > > > Spoke too soon[1], looks like others have been adding Turing
> support as
> > > > > > well (thanks to those helping with this).  I believe there's
> still a
> > > > few
> > > > > > changes we'd have to make to claim support though (mshadow CMake
> > > > changes,
> > > > > > PyPi package creation tweaks).
> > > > > >
> > > > > > 1:
> > > > > >
> > > > > >
> > > >
> https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> > > > > >
> > > > > > On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> > > > > > kellen.sunderland@gmail.com> wrote:
> > > > > >
> > > > > > > Hey Steffen, I'd like to be able to merge this PR for version
> 1.4:
> > > > > > > https://github.com/apache/incubator-mxnet/pull/13310 .
It
> fixes a
> > > > > > > regression in master which causes incorrect feature vectors
to
> be
> > > > output
> > > > > > > when using the TensorRT feature.  (Thanks to Nathalie for
> helping me
> > > > > > track
> > > > > > > down the root cause of the issue).   I'm currently blocked
on
> a CI
> > > > issue
> > > > > > I
> > > > > > > haven't seen before, but hope to have it resolved by EOW.
> > > > > > >
> > > > > > > One call-out I would make is that we currently don't support
> Turing
> > > > > > > architecture (sm_75).  I've been slowly trying to add support,
> but I
> > > > > > don't
> > > > > > > think I'd have capacity to do this done by EOW.  Does anyone
> feel
> > > > > > strongly
> > > > > > > we need this in the 1.4 release?  From my perspective this
will
> > > > already
> > > > > > be
> > > > > > > a strong release without it.
> > > > > > >
> > > > > > > On Tue, Nov 20, 2018 at 6:42 PM Steffen Rochel <
> > > > steffenrochel@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Thanks Patrick, lets target to get the PR's merged
this week.
> > > > > > >>
> > > > > > >> Call for contributions from the community: Right now
we have
> 10 PR
> > > > > > >> awaiting
> > > > > > >> merge
> > > > > > >> <
> > > > > > >>
> > > > > >
> > > >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge+
> > > > > > >> >
> > > > > > >> and
> > > > > > >> we have 61 open PR awaiting review.
> > > > > > >> <
> > > > > > >>
> > > > > >
> > > >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+label%3Apr-awaiting-review
> > > > > > >> >
> > > > > > >> I would appreciate if you all can help to review the
open PR
> and the
> > > > > > >> committers can drive the merge before code freeze for
1.4.0.
> > > > > > >>
> > > > > > >> The contributors on the Java API are making progress,
but not
> all
> > > > > > >> performance issues are resolved. With some luck it
should be
> > > > possible to
> > > > > > >> code freeze towards end of this week.
> > > > > > >>
> > > > > > >> Are there other critical features/bugs/PR you think
need to be
> > > > included
> > > > > > in
> > > > > > >> 1.4.0? If so, please communicate as soon as possible.
> > > > > > >>
> > > > > > >> Regards,
> > > > > > >> Steffen
> > > > > > >>
> > > > > > >> On Mon, Nov 19, 2018 at 8:26 PM Zhao, Patric <
> patric.zhao@intel.com
> > > > >
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > Thanks, Steffen. I think there is NO open issue
to block the
> > > > MKLDNN to
> > > > > > >> GA
> > > > > > >> > now.
> > > > > > >> >
> > > > > > >> > BTW, several quantization related PRs (#13297,#13260)
are
> under
> > > > the
> > > > > > >> review
> > > > > > >> > and I think it can be merged in this week.
> > > > > > >> >
> > > > > > >> > Thanks,
> > > > > > >> >
> > > > > > >> > --Patric
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > > -----Original Message-----
> > > > > > >> > > From: Steffen Rochel [mailto:steffenrochel@gmail.com]
> > > > > > >> > > Sent: Tuesday, November 20, 2018 2:57 AM
> > > > > > >> > > To: dev@mxnet.incubator.apache.org
> > > > > > >> > > Subject: Re: [Announce] Upcoming Apache MXNet
> (incubating) 1.4.0
> > > > > > >> release
> > > > > > >> > >
> > > > > > >> > > On Friday the contributors working on Java
API discovered
> a
> > > > > > potential
> > > > > > >> > > performance problem with inference using
Java API vs.
> Python.
> > > > > > >> > Investigation
> > > > > > >> > > is ongoing.
> > > > > > >> > > As the Java API is one of the main features
for the
> upcoming
> > > > > > release,
> > > > > > >> I
> > > > > > >> > > suggest to post-pone the code freeze towards
end of this
> week.
> > > > > > >> > >
> > > > > > >> > > Please provide feedback and concern about
the change in
> dates
> > > > for
> > > > > > code
> > > > > > >> > > freeze and 1.4.0 release. I will provide
updates on
> progress
> > > > > > resolving
> > > > > > >> > the
> > > > > > >> > > potential performance problem.
> > > > > > >> > >
> > > > > > >> > > Patrick - do you think it is possible to
resolve the
> remaining
> > > > > > issues
> > > > > > >> on
> > > > > > >> > MKL-
> > > > > > >> > > DNN this week, so we can consider GA for
MKL-DNN with
> 1.4.0?
> > > > > > >> > >
> > > > > > >> > > Regards,
> > > > > > >> > > Steffen
> > > > > > >> > >
> > > > > > >> > > On Thu, Nov 15, 2018 at 5:26 AM Anton Chernov
<
> > > > mechernov@gmail.com>
> > > > > > >> > > wrote:
> > > > > > >> > >
> > > > > > >> > > > I'd like to remind everyone that 'code
freeze' would
> mean
> > > > cutting
> > > > > > a
> > > > > > >> > > > v1.4.x release branch and all following
fixes would
> need to be
> > > > > > >> > backported.
> > > > > > >> > > > Development on master can be continued
as usual.
> > > > > > >> > > >
> > > > > > >> > > > Best
> > > > > > >> > > > Anton
> > > > > > >> > > >
> > > > > > >> > > > ср, 14 нояб. 2018 г. в 6:04,
Steffen Rochel <
> > > > > > >> steffenrochel@gmail.com>:
> > > > > > >> > > >
> > > > > > >> > > > > Dear MXNet community,
> > > > > > >> > > > > the agreed plan was to establish
code freeze for 1.4.0
> > > > release
> > > > > > >> > > > > today. As the 1.3.1 patch release
is still ongoing I
> > > > suggest to
> > > > > > >> > > > > post-pone the code freeze to Friday
16th November
> 2018.
> > > > > > >> > > > >
> > > > > > >> > > > > Sergey Kolychev has agreed to act
as co-release
> manager for
> > > > all
> > > > > > >> > > > > tasks
> > > > > > >> > > > which
> > > > > > >> > > > > require committer privileges. If
anybody is
> interested to
> > > > > > >> volunteer
> > > > > > >> > > > > as release manager - now is the
time to speak up.
> Otherwise
> > > > I
> > > > > > will
> > > > > > >> > > > > manage
> > > > > > >> > > > the
> > > > > > >> > > > > release.
> > > > > > >> > > > >
> > > > > > >> > > > > Regards,
> > > > > > >> > > > > Steffen
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message