mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Naveen Swamy <mnnav...@gmail.com>
Subject Re: CUDA Support [DISCUSS]
Date Sat, 06 Jan 2018 20:29:24 GMT
+1 to that. I think we don't have to run CUDA 8 on every PR.

On Sat, Jan 6, 2018 at 12:26 PM, Marco de Abreu <
marco.g.abreu@googlemail.com> wrote:

> Very good points, pracheer.
>
> We have been thinking about running nightly integration tests which will
> test the master branch on a wide base of settings (including IoT devices).
> How about switching to Cuda 9 in terms of PR validation, and doing
> extensive checks on cuda8/9 and another variety of environments during
> nightly. PRs would be tested on the latest and most widely used
> environments. I think this would be a viable solution as if issues arise in
> cuda 8 but not in cuda 9, this is rather something we as a community should
> investigate instead of just the PR creator as this could have a wide impact
> and may also influence other parts of MXNet.
>
> -Marco
>
> On Sat, Jan 6, 2018 at 9:18 PM, pracheer gupta <pracheer_gupta@hotmail.com
> >
> wrote:
>
> > I agree with Naveen that we shouldn’t be forcing production systems to
> > forcefully update immediately as soon as the new version comes out. If we
> > want to get MXNet more adopted, we should think about convenience of the
> > customers and the pain we might be forcing on them. Having said that, as
> > Bhavin pointed out, given the resources it might be tricky to completely
> > support n-1 version in earnest. In fact trying to support more
> > configurations runs the risk of resulting in overall lower quality of
> > support even for things that are important due to limited resources.
> >
> > Wondering if there is a compromise possible where we don’t create a
> > “panic” for production systems to upgrade. For instance, how about just
> > supporting all the permutations of software/hardware launches in last 2
> > years (or may be 3)? This would give enough time to people to upgrade
> while
> > reducing the amount of configurations we need to support?
> >
> > I also think that discussion in this thread seems to have two sides to
> it.
> > One specifically for cuda8/9 support and other being a general thought
> > process around how many options should mxnet community support.
> >
> > For cuda9, assuming it is truly backward compatible (read: no bugs at
> > all), we might be able to force everyone to upgrade in sometime
> (6months? 1
> > year?). Until then we should keep cuda8 in the ci system?
> >
> > Another possible solution is to be strategic for the time being and
> decide
> > what is the best decision that might help us get better in the long term
> at
> > the cost of short term pains: officially support only the latest (for
> next
> > few months at least) until we are able to get the CI system to a really
> > good place where it is convenient, easy to use and easy to add support
> for
> > more configurations and then figure out the policy of how many software
> > versions we should support?
> >
> > -Pracheer
> >
> >
> > > On Jan 6, 2018, at 11:18 AM, Marco de Abreu <
> > marco.g.abreu@googlemail.com> wrote:
> > >
> > > What do you think about finding out which version of cuda our users are
> > > actually using and maybe finding out why they didn't upgrade if they
> are
> > > still using an old version? Maybe there are some proper business
> reasons
> > we
> > > are not aware of.
> > >
> > > -Marco
> > >
> > > Am 06.01.2018 8:08 nachm. schrieb "Naveen Swamy" <mnnaveen@gmail.com>:
> > >
> > >> I will have to disagree with abandoning a N-1 version of the dependent
> > >> libraries as a general guideline for the project. there might be
> > exceptions
> > >> to this which should be discussed and agreed on and **well
> documented**
> > on
> > >> the Apache MXNet webpage.
> > >>
> > >> My reasoning is users who are running software in their production
> > >> environment take time to pick up the latest software to deploy on to
> > their
> > >> production environments. From my experience for critical systems, they
> > will
> > >> carefully test and evaluate new software before deploying. The latest
> > >> software sometimes have backward incompatible features that would
> break
> > >> their system. In order to earn trust from users its important we don't
> > >> start deprecating software as and when new libraries come up.
> > >>
> > >> What we could do is announce starting version MXNet 1.00... + N we
> would
> > >> only support N+1 library with good reasoning like this one CUDA 9
> being
> > >> backward compatible and recommend users to upgrade as well. Ideally
> this
> > >> would happen when we release new version of MXNet
> > >>
> > >> So I think we should support CUDA 8 at least till we release a new
> > version
> > >> of MXNet and pre-announce if we plan to drop.
> > >>
> > >> my 2 cents.
> > >>
> > >> Thanks, Naveen
> > >>
> > >>
> > >>
> > >> On Sat, Jan 6, 2018 at 9:48 AM, Bhavin Thaker <bhavinthaker@gmail.com
> >
> > >> wrote:
> > >>
> > >>> Hi Marco,
> > >>>
> > >>> Here are the Years in which the GPU architectures were introduced:
> > >>>
> > >>>   - Tesla: 2008;
> > >>>   - Fermi: 2010;
> > >>>   - Kepler: 2012;
> > >>>   - Maxwell: 2014;
> > >>>   - Pascal:2016;
> > >>>   - Volta: 2017;
> > >>>
> > >>> I see no need to support the 7+ year old Fermi architecture for
> > >> fast-moving
> > >>> Apache MXNet.
> > >>>
> > >>> Bhavin Thaker.
> > >>>
> > >>> On Sat, Jan 6, 2018 at 9:36 AM Marco de Abreu <
> > >>> marco.g.abreu@googlemail.com>
> > >>> wrote:
> > >>>
> > >>>> Just to provide some data. Dropping CUDA8 support would deprecate
> the
> > >>>> Fermi-Architecture, effectively affecting the following devices:
> > >>>>
> > >>>> 2.0 Fermi <https://en.wikipedia.org/wiki/Fermi_(microarchitecture)>
> > >>> GF100,
> > >>>> GF110 GeForce GTX 590, GeForce GTX 580, GeForce GTX 570, GeForce
GTX
> > >> 480,
> > >>>> GeForce GTX 470, GeForce GTX 465, GeForce GTX 480M Quadro 6000,
> Quadro
> > >>>> 5000, Quadro 4000, Quadro 4000 for Mac, Quadro Plex 7000, Quadro
> > 5010M,
> > >>>> Quadro 5000M Tesla C2075, Tesla C2050/C2070, Tesla
> > >>> M2050/M2070/M2075/M2090
> > >>>> 2.1 GF104, GF106 GF108, GF114, GF116, GF117, GF119 GeForce GTX
560
> Ti,
> > >>>> GeForce GTX 550 Ti, GeForce GTX 460, GeForce GTS 450, GeForce GTS
> > 450*,
> > >>>> GeForce GT 640 (GDDR3), GeForce GT 630, GeForce GT 620, GeForce
GT
> > 610,
> > >>>> GeForce GT 520, GeForce GT 440, GeForce GT 440*, GeForce GT 430,
> > >> GeForce
> > >>> GT
> > >>>> 430*, GeForce GT 420*,
> > >>>> GeForce GTX 675M, GeForce GTX 670M, GeForce GT 635M, GeForce GT
> 630M,
> > >>>> GeForce GT 625M, GeForce GT 720M, GeForce GT 620M, GeForce 710M,
> > >> GeForce
> > >>>> 610M, GeForce 820M, GeForce GTX 580M, GeForce GTX 570M, GeForce
GTX
> > >> 560M,
> > >>>> GeForce GT 555M, GeForce GT 550M, GeForce GT 540M, GeForce GT 525M,
> > >>> GeForce
> > >>>> GT 520MX, GeForce GT 520M, GeForce GTX 485M, GeForce GTX 470M,
> GeForce
> > >>> GTX
> > >>>> 460M, GeForce GT 445M, GeForce GT 435M, GeForce GT 420M, GeForce
GT
> > >> 415M,
> > >>>> GeForce 710M, GeForce 410M Quadro 2000, Quadro 2000D, Quadro 600,
> > >> Quadro
> > >>>> 4000M, Quadro 3000M, Quadro 2000M, Quadro 1000M, NVS 310, NVS 315,
> NVS
> > >>>> 5400M, NVS 5200M, NVS 4200M
> > >>>>
> > >>>> -Marco
> > >>>>
> > >>>> On Sat, Jan 6, 2018 at 6:31 PM, kellen sunderland <
> > >>>> kellen.sunderland@gmail.com> wrote:
> > >>>>
> > >>>>> I like that proposal Bhavin.  I'm also interested to see what
the
> > >> other
> > >>>>> community members think.
> > >>>>>
> > >>>>> On Sat, Jan 6, 2018 at 6:27 PM, Bhavin Thaker <
> > >> bhavinthaker@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi Kellen,
> > >>>>>>
> > >>>>>> Here is my opinion and stand on this:
> > >>>>>>
> > >>>>>> I see no need to test on CUDA8 in Apache MXNet CI, especially
when
> > >>>> CUDA9
> > >>>>> is
> > >>>>>> backward compatible with earlier Nvidia hardware generations.
> There
> > >>> is
> > >>>>> time
> > >>>>>> and resources cost to maintaining the various combinations
in the
> > >> CI
> > >>>> and
> > >>>>> so
> > >>>>>> I am NOT in favor of running CUDA8 in CI unless there is
a
> > >> technical
> > >>>>>> reason/requirement for it. This approach helps to encourage
users
> > >> to
> > >>>> move
> > >>>>>> to the latest CUDA version and thus keep the open-source
> > >> community’s
> > >>>>>> maintenance cost low for the generic option of CUDA9.
> > >>>>>>
> > >>>>>> For example: If a user opens a github issue/problem with
Apache
> > >> MXNet
> > >>>> and
> > >>>>>> CUDA8, I would ask the user to test it with CUDA9. If the
problem
> > >>>> happens
> > >>>>>> only on CUDA8, then a volunteer in the community may work
on it.
> If
> > >>> the
> > >>>>>> problem happens on CUDA9 as well, then, in my humble opinion,
and
> > >>> this
> > >>>>>> problem must be fixed by the community. In short, I propose
that
> > >> the
> > >>>>> MXNet
> > >>>>>> CI run tests only with latest CUDA9 version and NOT CUDA8.
> > >>>>>>
> > >>>>>> I am eager to hear alternate viewpoints/corrections from
folks
> > >> other
> > >>>> than
> > >>>>>> Kellen and me.
> > >>>>>>
> > >>>>>> Bhavin Thaker.
> > >>>>>>
> > >>>>>> On Sat, Jan 6, 2018 at 8:24 AM kellen sunderland <
> > >>>>>> kellen.sunderland@gmail.com> wrote:
> > >>>>>>
> > >>>>>>> Thanks for the thoughts Bhavin, supporting the latest
release
> > >> would
> > >>>>> also
> > >>>>>> be
> > >>>>>>> an option, and it would be easier from a support point
of view.
> > >>>>>>>
> > >>>>>>> "2) I think your question probably is what should be
tested by
> > >> the
> > >>>>> Apache
> > >>>>>>> MXNet CI and NOT what is supported by Apache MXNet,
correct?"
> > >>>>>>>
> > >>>>>>> I view these two things as being closely related, if
not
> > >>> equivalent.
> > >>>>> If
> > >>>>>> we
> > >>>>>>> don't run at least basic tests of old versions of CUDA
I think
> > >>> there
> > >>>>> will
> > >>>>>>> be issues that slip through.  That being said we can
rely on
> > >> users
> > >>> to
> > >>>>>>> report these issues, and chances are we'll be able
to provide
> > >>>> backwards
> > >>>>>>> compatible patches.  At a minimum I'd recommend we
should run
> > >> tests
> > >>>> on
> > >>>>>> all
> > >>>>>>> supported CUDA versions before a release.
> > >>>>>>>
> > >>>>>>> -Kellen
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Sat, Jan 6, 2018 at 5:05 PM, Bhavin Thaker <
> > >>>> bhavinthaker@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi Kellen,
> > >>>>>>>>
> > >>>>>>>> 1) Does Apache MXNet (Incubating) have a support
matrix? I
> > >> think
> > >>>> the
> > >>>>>>> answer
> > >>>>>>>> is no, because I don’t know of where it is documented.
One of
> > >> the
> > >>>>>> mentors
> > >>>>>>>> told me earlier that the community uses and modifies
the
> > >>>> open-source
> > >>>>>>>> project as per their individual  requirements or
those of the
> > >>>>>> community.
> > >>>>>>> As
> > >>>>>>>> far as I know, there is no single entity that is
responsible
> > >> for
> > >>>>>>> supporting
> > >>>>>>>> something in MXNet — corrections to my understanding
are
> > >> welcome.
> > >>>>>>>>
> > >>>>>>>> 2) I think your question probably is what should
be tested by
> > >> the
> > >>>>>> Apache
> > >>>>>>>> MXNet CI and NOT what is supported by Apache MXNet,
correct?
> > >>>>>>>>
> > >>>>>>>> If yes, I propose testing only the latest CUDA9
and the
> > >>> respective
> > >>>>>> latest
> > >>>>>>>> cuDNN version in the MXNet CI since CUDA9 is backward
> > >> compatible
> > >>>> with
> > >>>>>>>> earlier Nvidia hardware generations.
> > >>>>>>>>
> > >>>>>>>> I would like to hear reasons why this would not
work.
> > >>>>>>>>
> > >>>>>>>> I have commented on the github issue as well:
> > >>>>>>>> https://github.com/apache/incubator-mxnet/issues/8805
> > >>>>>>>>
> > >>>>>>>> Bhavin Thaker.
> > >>>>>>>>
> > >>>>>>>> On Sat, Jan 6, 2018 at 3:30 AM kellen sunderland
<
> > >>>>>>>> kellen.sunderland@gmail.com> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hello all, I'd like to propose that we nail
down exactly
> > >> which
> > >>>>>> versions
> > >>>>>>>> of
> > >>>>>>>>> CUDA we're supporting.  We can then ensure
that we've got
> > >> good
> > >>>> test
> > >>>>>>>>> coverage for those specific versions in CI.
 At the moment
> > >> it's
> > >>>>>>> ambiguous
> > >>>>>>>>> what our current policy is.  I.e. when do we
drop support for
> > >>> old
> > >>>>>>>>> versions?  As a result we potentially cut a
release promising
> > >>> to
> > >>>>>>> support
> > >>>>>>>> a
> > >>>>>>>>> certain version of CUDA, then retroactively
drop support
> > >> after
> > >>> we
> > >>>>>> find
> > >>>>>>> an
> > >>>>>>>>> issue.
> > >>>>>>>>>
> > >>>>>>>>> I'd like to propose that we officially support
N, and N-1
> > >>>> versions
> > >>>>> of
> > >>>>>>>> CUDA,
> > >>>>>>>>> where N is the most recent major version release.
 In
> > >> addition
> > >>> we
> > >>>>> can
> > >>>>>>> do
> > >>>>>>>>> our best to support libraries that are available
for download
> > >>> for
> > >>>>>> those
> > >>>>>>>>> versions.  Supporting these CUDA versions would
also dictate
> > >>>> which
> > >>>>>>>> hardware
> > >>>>>>>>> we support in terms of compute capability (of
course resource
> > >>>>>>> constraints
> > >>>>>>>>> would also play some role in our ability to
support some
> > >>>> hardware).
> > >>>>>>>>>
> > >>>>>>>>> As an example this would mean that currently
we'd officially
> > >>>>> support
> > >>>>>>> CUDA
> > >>>>>>>>> 9.* and 8.  This would imply we support CUDNN
5.1 through 7,
> > >> as
> > >>>>> those
> > >>>>>>>>> libraries are available for CUDA 8, and 9.
 It would also
> > >> mean
> > >>> we
> > >>>>>>> support
> > >>>>>>>>> 3.0-7.x (Kepler, Maxwell, Pascal, Volta) taking
the more
> > >>>>> restrictive
> > >>>>>>>>> hardware requirements of CUDA 9 into account.
> > >>>>>>>>>
> > >>>>>>>>> What do you all think?  Would this be a reasonable
support
> > >>>>> strategy?
> > >>>>>>> Are
> > >>>>>>>>> these the versions you'd like to see covered
in CI?
> > >>>>>>>>>
> > >>>>>>>>> -Kellen
> > >>>>>>>>>
> > >>>>>>>>> A relevant issue:
> > >>>>>>> https://github.com/apache/incubator-mxnet/issues/8805
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message