mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhao, Patric" <patric.z...@intel.com>
Subject RE: Proposal to make MKLDNN as default CPU backend
Date Tue, 19 Nov 2019 05:37:51 GMT
It may be a concern but little noise can't affect the final results if the algorithm is stable
in numerical.  
The MKLDNN backend with mxnet-mkl has been used for 2 years and we didn't see the coverage
issue caused by multiple threading.
In other words, GPU programming mode works well on training where the non-deterministic also
exists from multiple threads.

Parts of training accuracy was pasted in the first PR when MKLDNN is integrated.
https://github.com/apache/incubator-mxnet/pull/8302#issuecomment-359674818

In conclusion, it may happen with very little probability. I believe we can get a solution
in case it happens someday.

Thanks,

--Patric


> -----Original Message-----
> From: Chris Olivier <cjolivier01@gmail.com>
> Sent: Tuesday, November 19, 2019 11:51 AM
> To: dev@mxnet.incubator.apache.org
> Cc: Tao Lv <mutouorz@gmail.com>
> Subject: Re: Proposal to make MKLDNN as default CPU backend
> 
> (for non mkl dropout, for instance)
> 
> On Mon, Nov 18, 2019 at 7:50 PM Chris Olivier <cjolivier01@gmail.com>
> wrote:
> 
> > To address the deterministic item, I know for a fact that training
> > will not be deterministic in some cases where the “parallel random”
> > class is utilized in parallel threads, such as OMP, if the number of
> > cores is different, even with the same seed, because threads are
> > seeded independently and different number of threads will end up
> > generating different random number sequences. Dropout operator being
> an example.
> >
> > On Mon, Nov 18, 2019 at 6:39 PM Alfredo Luque
> > <alfredo.luque@airbnb.com.invalid> wrote:
> >
> >> For AMD CPUs, you’d want to perform validation because now MKL-DNN
> >> would be enabled by default. Historically, other intel libraries
> >> (along with the ICC
> >> compiler) have had performance issues on AMD CPUs. It’s just worth
> >> double checking to make sure that’s not the case here. Perhaps some
> >> MKL-DNN authors can chime in though. It’s not sufficient to double
> >> check that an
> >> AVX2 package passes tests.
> >>
> >> Agreed in the case we’re not releasing ARM binaries.
> >>
> >> The reproducibility argument is around the results being numerically
> >> reproducible. That is, eg; if I train a model with some fixed set of
> >> data, some random seed, etc. and then run inference on it do I get
> >> the exact same floating point values for the weights and results?
> >> Does MxNet already offer this without MKL-DNN?
> >>
> >> On November 18, 2019 at 6:32:07 PM, Tao Lv (mutouorz@gmail.com)
> wrote:
> >>
> >> Regarding the cases listed by Marco:
> >> - AMD CPU
> >> From my architecture knowledge, what works on C4 instances (with AVX2
> >> support) should also work well on m5a, right? I think mxnet-mkl and
> >> mxnet-cuxxmkl packages have been fully validated on AVX2 machines.
> >> Also, we didn't perform any validation on AMD CPU before, why we need
> >> do that for this time?
> >>
> >> - ARM CPU
> >> I don't know we're releasing any convenience binaries for ARM CPU.
> >> This proposal mainly targets those pypi packages.
> >>
> >> - Windows
> >> Already validated by CI. We're also releasing mxnet-mkl packages for Win.
> >>
> >> - GPU and MKLDNN enabled
> >> Already validated by CI and mxnet-cuxxmkl packages have been released
> >> for several versions.
> >>
> >> - Fully reproducible results (medical and financial sector requested
> >> that and we have some flags for cuda) Not sure I understand this
> >> case. We already have MKL-DNN backend for a while. Functionality and
> >> correctness of it have been verified by MXNet users.
> >>
> >> -tao
> >>
> >> On Tue, Nov 19, 2019 at 4:41 AM Marco de Abreu
> >> <marco.g.abreu@gmail.com>
> >> wrote:
> >>
> >> > Sorry, my intent with the "non-standard" phrase was not about
> >> > general
> >> MXNet
> >> > but rather from MKLDNNs point of view, considering that it's being
> >> > developed by Intel, I assumed that MKLDNN might consider non-intel
> >> > use-cases non standard.
> >> >
> >> > -Marco
> >> >
> >> > Skalicky, Sam <sskalic@amazon.com.invalid> schrieb am Mo., 18. Nov.
> >> 2019,
> >> > 21:34:
> >> >
> >> > > Thanks Alfredo, if you can create a GitHub issue with notes/steps
> >> > > we
> >> can
> >> > > add this to the todo list for integrating with the MXNet CI to
> >> > > test on
> >> > m5a
> >> > > instances too. Then we can start tracking this on a regular
> >> > > basis. It
> >> > would
> >> > > be great to actually test on ARM instances now that AWS has A1
> >> instances
> >> > > too…..ill add it to the wish list ;-D
> >> > >
> >> > > Sam
> >> > >
> >> > > > On Nov 18, 2019, at 12:32 PM, Alfredo Luque <
> >> alfredo.luque@airbnb.com
> >> > .INVALID>
> >> > > wrote:
> >> > > >
> >> > > > Happy to run some benchmarks on an AWS m5a instance (Epyc) and
> >> > > > first generation AMD Threadripper Gen 1 if someone has
> >> > > > something easy to
> >> run
> >> > > and
> >> > > > representative.
> >> > > >
> >> > > > On November 18, 2019 at 12:29:31 PM, Skalicky, Sam (
> >> > > > sskalic@amazon.com.invalid) wrote:
> >> > > >
> >> > > > Thanks a good idea Alfredo, are you able to help test on AMD
CPUs?
> >> Or
> >> > is
> >> > > > there someone else in the mxnet dev@ community who can help?
> >> > > >
> >> > > > Sam
> >> > > >
> >> > > >> On Nov 18, 2019, at 12:27 PM, Alfredo Luque
> >> > > > <alfredo.luque@airbnb.com.INVALID> wrote:
> >> > > >>
> >> > > >> Verifying that there isn’t a slowdown on AMD CPUs (eg;
Ryzen /
> >> Epyc)
> >> > > > would
> >> > > >> definitely make sense as a requirement. It seems odd to
> >> > > >> classify
> >> that
> >> > as
> >> > > > a
> >> > > >> “nonstandard” use case.
> >> > > >>
> >> > > >> On November 18, 2019 at 12:20:33 PM, Skalicky, Sam (
> >> > > >> sskalic@amazon.com.invalid) wrote:
> >> > > >>
> >> > > >> Thanks Patric & team for your work over the years to
make
> >> > > >> MXNet
> >> fast
> >> > > with
> >> > > >> MKLDNN!
> >> > > >>
> >> > > >> I think it would be great to make MKLDNN enabled by default.
> >> > > >> We
> >> will
> >> > > need
> >> > > >> to continue producing variants without MKLDNN for those who
> >> > > >> don’t
> >> want
> >> > > it
> >> > > >> (Marco enumerated some use cases). How do you propose to
> >> > > >> identify
> >> the
> >> > > pip
> >> > > >> wheels with/without MKLDNN? Previously we had: mxnet-mkl
and
> >> > > > mxnet-cu101mkl
> >> > > >> with MKLDNN. If the plain “mxnet” pip wheel now contains
> >> > > >> MKLDNN
> >> what
> >> > do
> >> > > > you
> >> > > >> propose we call the build without MKLDNN? mxnet-nomkl?
> >> > > >>
> >> > > >> Thanks!
> >> > > >> Sam
> >> > > >>
> >> > > >>> On Nov 18, 2019, at 11:08 AM, Marco de Abreu <
> >> > marco.g.abreu@gmail.com>
> >> > > >> wrote:
> >> > > >>>
> >> > > >>> Hi Patric,
> >> > > >>>
> >> > > >>> First of all, thanks a lot to you and your team for all
the
> >> > > >>> effort
> >> on
> >> > > >> MXNet
> >> > > >>> and mkldnn!
> >> > > >>>
> >> > > >>> Generally I'm inclined towards your proposal, but I'm
> >> > > >>> thinking
> >> about
> >> > > the
> >> > > >>> non-standard use cases:
> >> > > >>> - AMD CPU
> >> > > >>> - ARM CPU
> >> > > >>> - Windows
> >> > > >>> - GPU and MKLDNN enabled
> >> > > >>> - Fully reproducible results (medical and financial sector
> >> requested
> >> > > > that
> >> > > >>> and we have some flags for cuda)
> >> > > >>>
> >> > > >>> Is mkldnn fully compatible with these use cases? If not,
what
> >> would
> >> > > >> happen?
> >> > > >>> If yes, do we have performance numbers?
> >> > > >>>
> >> > > >>> Best regards,
> >> > > >>> Marco
> >> > > >>>
> >> > > >>> Zhao, Patric <patric.zhao@intel.com> schrieb am
Mo., 18. Nov.
> >> 2019,
> >> > > >> 14:00:
> >> > > >>>
> >> > > >>>> Hi MXNet community,
> >> > > >>>>
> >> > > >>>> From the first MKLDNN backend integrated in release
1.2, the
> >> > community
> >> > > >> is
> >> > > >>>> continuously improving the quality and performance
of MKLDNN
> >> > > >>>> CPU
> >> > > >> backend.
> >> > > >>>> Nowadays, the MKLDNN backend is widely used for the
> >> > > >>>> inference,
> >> > > >> especially
> >> > > >>>> for INT8 inference, and we got lots of very positive
> >> > > >>>> feedbacks
> >> from
> >> > > >> MXNet
> >> > > >>>> users.
> >> > > >>>>
> >> > > >>>> Achieved milestones as below:
> >> > > >>>>
> >> > > >>>> - MKLDNN integrated into Apache MXNet from release
1.2, Feb,
> >> > > >>>> 2018
> >> > [1]
> >> > > >>>> - MKLDNN backend as default CPU backend from source
> >> > > >>>> building,
> >> Jan,
> >> > > 2019
> >> > > >> [2]
> >> > > >>>> - MKLDNN subgraph optimization as default for the
inference,
> >> > > >>>> Jul,
> >> > 2019
> >> > > >> [3]
> >> > > >>>> - MKLDNN major version upgrade in release 1.6, Oct,
2019 [4]
> >> > > >>>>
> >> > > >>>> To make more successful and technical leadership
for Apache
> >> > > >>>> MXNet
> >> in
> >> > > > the
> >> > > >>>> industry, I propose to make MKLDNN as default CPU
backend in
> >> > > >>>> all
> >> > > binary
> >> > > >>>> distribution from the next release.
> >> > > >>>> The new milestone includes:
> >> > > >>>>
> >> > > >>>> - Static link MKLDNN library in the binary avoiding
the
> >> > > >>>> mismatch
> >> > > > version
> >> > > >>>> in the runtime [5]
> >> > > >>>> - Make nightly build with MKLDNN default from master
pre 1.7
> >> release
> >> > > >>>> - Binary distribution with MKLDNN default from 1.7
release.
> >> > > >>>>
> >> > > >>>> What will be changed:
> >> > > >>>>
> >> > > >>>> - mxnet and mxnet-cuXX binary will be built with
MKLDNN=1
> >> > > >>>> - mxnet-mkl and mxnet-cuXXmkl will be not changed
in the
> >> > > >>>> minor
> >> > release
> >> > > >>>> (1.x) and plan to remove in next major release (2.0)
> >> > > >>>>
> >> > > >>>> Suggestions and comments are highly appreciated.
> >> > > >>>>
> >> > > >>>> Thanks,
> >> > > >>>>
> >> > > >>>> --Patric
> >> > > >>>>
> >> > > >>>>
> >> > > >>>> [1] https://github.com/apache/incubator-mxnet/pull/9677
> >> > > >>>> [2]
> >> > > >>>>
> >> > > >>
> >> > > >
> >> > >
> >> >
> >>
> >>
> https://lists.apache.org/thread.html/bfeae6ee46374112eb4dff1470c26295
> >> 9101e4bffb19930926963535@%3Cdev.mxnet.apache.org%3E
> >> > > >>>> [3] https://github.com/apache/incubator-mxnet/pull/15518
> >> > > >>>> [4]
> >> > > >>>>
> >> > > >>
> >> > > >
> >> > >
> >> >
> >>
> >>
> https://lists.apache.org/thread.html/f46ab920f18795496eafe713e6e9e561
> >> c684e06189085cec17b401dc@%3Cdev.mxnet.apache.org%3E
> >> > > >>>> [5] https://github.com/apache/incubator-mxnet/pull/16731
> >> > > >>>>
> >> > > >>
> >> > > >> —
> >> > > >> Alfredo Luque
> >> > > >> Software Engineer
> >> > > >> Machine Learning Infrastructure Airbnb San Francisco, CA
> >> > > >
> >> > > > —
> >> > > > Alfredo Luque
> >> > > > Software Engineer
> >> > > > Machine Learning Infrastructure Airbnb San Francisco, CA
> >> > >
> >> > >
> >> >
> >>
> >> —
> >> Alfredo Luque
> >> Software Engineer
> >> Machine Learning Infrastructure
> >> Airbnb
> >> San Francisco, CA
> >>
> >
Mime
View raw message