mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Asmus Hetzel <asmushet...@yahoo.de.INVALID>
Subject Re: CUDA Support [DISCUSS]
Date Tue, 09 Jan 2018 09:54:34 GMT
 +1 for testing CUDA8 and CUDA9 (i.e. at least one version backward).
I personally had worked recently on functionalities in the linalg-namespace where I was not 
aware that I introduced CUDA 8 only dependencies that break immediately every CUDA7.5 build.
I only happened to know this as Jenkins suddenly failed (it was running 7.5).  It is usually
a pain to dig up documentation of old CUDA versions, so I guess I'm not the only one that
simply relies on integration testing to figure out what we support and what not. For me, adding
some #ifdefs to the code and ensuring that this functionality is not present within MxNet
on Cuda7.5 was a matter of 30 minutes. But if this would have slipped through, the pain for
all users around the globe running still Cuda7.5 and suddenly have a broken MxNet version
would have been magnitudes higher. 



    Am Samstag, 6. Januar 2018, 22:18:28 MEZ hat Bhavin Thaker <bhavinthaker@gmail.com>
Folgendes geschrieben:  
 
 Good arguments, Marco, Naveen, Pracheer.

+1 to the suggestion of testing CUDA8 in few nightly instances and using
CUDA9 for most instances in CI.

There is no support matrix documented for Apache MXNet and so there is no
forceful upgrade. If a user wants to support a version, they can contribute
to the open-source Apache MXNet.

Bhavin Thaker.

On Sat, Jan 6, 2018 at 12:30 PM Naveen Swamy <mnnaveen@gmail.com> wrote:

> +1 to that. I think we don't have to run CUDA 8 on every PR.
>
> On Sat, Jan 6, 2018 at 12:26 PM, Marco de Abreu <
> marco.g.abreu@googlemail.com> wrote:
>
> > Very good points, pracheer.
> >
> > We have been thinking about running nightly integration tests which will
> > test the master branch on a wide base of settings (including IoT
> devices).
> > How about switching to Cuda 9 in terms of PR validation, and doing
> > extensive checks on cuda8/9 and another variety of environments during
> > nightly. PRs would be tested on the latest and most widely used
> > environments. I think this would be a viable solution as if issues arise
> in
> > cuda 8 but not in cuda 9, this is rather something we as a community
> should
> > investigate instead of just the PR creator as this could have a wide
> impact
> > and may also influence other parts of MXNet.
> >
> > -Marco
> >
> > On Sat, Jan 6, 2018 at 9:18 PM, pracheer gupta <
> pracheer_gupta@hotmail.com
> > >
> > wrote:
> >
> > > I agree with Naveen that we shouldn’t be forcing production systems to
> > > forcefully update immediately as soon as the new version comes out. If
> we
> > > want to get MXNet more adopted, we should think about convenience of
> the
> > > customers and the pain we might be forcing on them. Having said that,
> as
> > > Bhavin pointed out, given the resources it might be tricky to
> completely
> > > support n-1 version in earnest. In fact trying to support more
> > > configurations runs the risk of resulting in overall lower quality of
> > > support even for things that are important due to limited resources.
> > >
> > > Wondering if there is a compromise possible where we don’t create a
> > > “panic” for production systems to upgrade. For instance, how about just
> > > supporting all the permutations of software/hardware launches in last 2
> > > years (or may be 3)? This would give enough time to people to upgrade
> > while
> > > reducing the amount of configurations we need to support?
> > >
> > > I also think that discussion in this thread seems to have two sides to
> > it.
> > > One specifically for cuda8/9 support and other being a general thought
> > > process around how many options should mxnet community support.
> > >
> > > For cuda9, assuming it is truly backward compatible (read: no bugs at
> > > all), we might be able to force everyone to upgrade in sometime
> > (6months? 1
> > > year?). Until then we should keep cuda8 in the ci system?
> > >
> > > Another possible solution is to be strategic for the time being and
> > decide
> > > what is the best decision that might help us get better in the long
> term
> > at
> > > the cost of short term pains: officially support only the latest (for
> > next
> > > few months at least) until we are able to get the CI system to a really
> > > good place where it is convenient, easy to use and easy to add support
> > for
> > > more configurations and then figure out the policy of how many software
> > > versions we should support?
> > >
> > > -Pracheer
> > >
> > >
> > > > On Jan 6, 2018, at 11:18 AM, Marco de Abreu <
> > > marco.g.abreu@googlemail.com> wrote:
> > > >
> > > > What do you think about finding out which version of cuda our users
> are
> > > > actually using and maybe finding out why they didn't upgrade if they
> > are
> > > > still using an old version? Maybe there are some proper business
> > reasons
> > > we
> > > > are not aware of.
> > > >
> > > > -Marco
> > > >
> > > > Am 06.01.2018 8:08 nachm. schrieb "Naveen Swamy" <mnnaveen@gmail.com
> >:
> > > >
> > > >> I will have to disagree with abandoning a N-1 version of the
> dependent
> > > >> libraries as a general guideline for the project. there might be
> > > exceptions
> > > >> to this which should be discussed and agreed on and **well
> > documented**
> > > on
> > > >> the Apache MXNet webpage.
> > > >>
> > > >> My reasoning is users who are running software in their production
> > > >> environment take time to pick up the latest software to deploy on
to
> > > their
> > > >> production environments. From my experience for critical systems,
> they
> > > will
> > > >> carefully test and evaluate new software before deploying. The
> latest
> > > >> software sometimes have backward incompatible features that would
> > break
> > > >> their system. In order to earn trust from users its important we
> don't
> > > >> start deprecating software as and when new libraries come up.
> > > >>
> > > >> What we could do is announce starting version MXNet 1.00... + N we
> > would
> > > >> only support N+1 library with good reasoning like this one CUDA 9
> > being
> > > >> backward compatible and recommend users to upgrade as well. Ideally
> > this
> > > >> would happen when we release new version of MXNet
> > > >>
> > > >> So I think we should support CUDA 8 at least till we release a new
> > > version
> > > >> of MXNet and pre-announce if we plan to drop.
> > > >>
> > > >> my 2 cents.
> > > >>
> > > >> Thanks, Naveen
> > > >>
> > > >>
> > > >>
> > > >> On Sat, Jan 6, 2018 at 9:48 AM, Bhavin Thaker <
> bhavinthaker@gmail.com
> > >
> > > >> wrote:
> > > >>
> > > >>> Hi Marco,
> > > >>>
> > > >>> Here are the Years in which the GPU architectures were introduced:
> > > >>>
> > > >>>  - Tesla: 2008;
> > > >>>  - Fermi: 2010;
> > > >>>  - Kepler: 2012;
> > > >>>  - Maxwell: 2014;
> > > >>>  - Pascal:2016;
> > > >>>  - Volta: 2017;
> > > >>>
> > > >>> I see no need to support the 7+ year old Fermi architecture for
> > > >> fast-moving
> > > >>> Apache MXNet.
> > > >>>
> > > >>> Bhavin Thaker.
> > > >>>
> > > >>> On Sat, Jan 6, 2018 at 9:36 AM Marco de Abreu <
> > > >>> marco.g.abreu@googlemail.com>
> > > >>> wrote:
> > > >>>
> > > >>>> Just to provide some data. Dropping CUDA8 support would deprecate
> > the
> > > >>>> Fermi-Architecture, effectively affecting the following devices:
> > > >>>>
> > > >>>> 2.0 Fermi <
> https://en.wikipedia.org/wiki/Fermi_(microarchitecture)>
> > > >>> GF100,
> > > >>>> GF110 GeForce GTX 590, GeForce GTX 580, GeForce GTX 570, GeForce
> GTX
> > > >> 480,
> > > >>>> GeForce GTX 470, GeForce GTX 465, GeForce GTX 480M Quadro
6000,
> > Quadro
> > > >>>> 5000, Quadro 4000, Quadro 4000 for Mac, Quadro Plex 7000,
Quadro
> > > 5010M,
> > > >>>> Quadro 5000M Tesla C2075, Tesla C2050/C2070, Tesla
> > > >>> M2050/M2070/M2075/M2090
> > > >>>> 2.1 GF104, GF106 GF108, GF114, GF116, GF117, GF119 GeForce
GTX 560
> > Ti,
> > > >>>> GeForce GTX 550 Ti, GeForce GTX 460, GeForce GTS 450, GeForce
GTS
> > > 450*,
> > > >>>> GeForce GT 640 (GDDR3), GeForce GT 630, GeForce GT 620, GeForce
GT
> > > 610,
> > > >>>> GeForce GT 520, GeForce GT 440, GeForce GT 440*, GeForce GT
430,
> > > >> GeForce
> > > >>> GT
> > > >>>> 430*, GeForce GT 420*,
> > > >>>> GeForce GTX 675M, GeForce GTX 670M, GeForce GT 635M, GeForce
GT
> > 630M,
> > > >>>> GeForce GT 625M, GeForce GT 720M, GeForce GT 620M, GeForce
710M,
> > > >> GeForce
> > > >>>> 610M, GeForce 820M, GeForce GTX 580M, GeForce GTX 570M, GeForce
> GTX
> > > >> 560M,
> > > >>>> GeForce GT 555M, GeForce GT 550M, GeForce GT 540M, GeForce
GT
> 525M,
> > > >>> GeForce
> > > >>>> GT 520MX, GeForce GT 520M, GeForce GTX 485M, GeForce GTX 470M,
> > GeForce
> > > >>> GTX
> > > >>>> 460M, GeForce GT 445M, GeForce GT 435M, GeForce GT 420M, GeForce
> GT
> > > >> 415M,
> > > >>>> GeForce 710M, GeForce 410M Quadro 2000, Quadro 2000D, Quadro
600,
> > > >> Quadro
> > > >>>> 4000M, Quadro 3000M, Quadro 2000M, Quadro 1000M, NVS 310,
NVS 315,
> > NVS
> > > >>>> 5400M, NVS 5200M, NVS 4200M
> > > >>>>
> > > >>>> -Marco
> > > >>>>
> > > >>>> On Sat, Jan 6, 2018 at 6:31 PM, kellen sunderland <
> > > >>>> kellen.sunderland@gmail.com> wrote:
> > > >>>>
> > > >>>>> I like that proposal Bhavin.  I'm also interested to
see what the
> > > >> other
> > > >>>>> community members think.
> > > >>>>>
> > > >>>>> On Sat, Jan 6, 2018 at 6:27 PM, Bhavin Thaker <
> > > >> bhavinthaker@gmail.com>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> Hi Kellen,
> > > >>>>>>
> > > >>>>>> Here is my opinion and stand on this:
> > > >>>>>>
> > > >>>>>> I see no need to test on CUDA8 in Apache MXNet CI,
especially
> when
> > > >>>> CUDA9
> > > >>>>> is
> > > >>>>>> backward compatible with earlier Nvidia hardware generations.
> > There
> > > >>> is
> > > >>>>> time
> > > >>>>>> and resources cost to maintaining the various combinations
in
> the
> > > >> CI
> > > >>>> and
> > > >>>>> so
> > > >>>>>> I am NOT in favor of running CUDA8 in CI unless there
is a
> > > >> technical
> > > >>>>>> reason/requirement for it. This approach helps to
encourage
> users
> > > >> to
> > > >>>> move
> > > >>>>>> to the latest CUDA version and thus keep the open-source
> > > >> community’s
> > > >>>>>> maintenance cost low for the generic option of CUDA9.
> > > >>>>>>
> > > >>>>>> For example: If a user opens a github issue/problem
with Apache
> > > >> MXNet
> > > >>>> and
> > > >>>>>> CUDA8, I would ask the user to test it with CUDA9.
If the
> problem
> > > >>>> happens
> > > >>>>>> only on CUDA8, then a volunteer in the community may
work on it.
> > If
> > > >>> the
> > > >>>>>> problem happens on CUDA9 as well, then, in my humble
opinion,
> and
> > > >>> this
> > > >>>>>> problem must be fixed by the community. In short,
I propose that
> > > >> the
> > > >>>>> MXNet
> > > >>>>>> CI run tests only with latest CUDA9 version and NOT
CUDA8.
> > > >>>>>>
> > > >>>>>> I am eager to hear alternate viewpoints/corrections
from folks
> > > >> other
> > > >>>> than
> > > >>>>>> Kellen and me.
> > > >>>>>>
> > > >>>>>> Bhavin Thaker.
> > > >>>>>>
> > > >>>>>> On Sat, Jan 6, 2018 at 8:24 AM kellen sunderland <
> > > >>>>>> kellen.sunderland@gmail.com> wrote:
> > > >>>>>>
> > > >>>>>>> Thanks for the thoughts Bhavin, supporting the
latest release
> > > >> would
> > > >>>>> also
> > > >>>>>> be
> > > >>>>>>> an option, and it would be easier from a support
point of view.
> > > >>>>>>>
> > > >>>>>>> "2) I think your question probably is what should
be tested by
> > > >> the
> > > >>>>> Apache
> > > >>>>>>> MXNet CI and NOT what is supported by Apache MXNet,
correct?"
> > > >>>>>>>
> > > >>>>>>> I view these two things as being closely related,
if not
> > > >>> equivalent.
> > > >>>>> If
> > > >>>>>> we
> > > >>>>>>> don't run at least basic tests of old versions
of CUDA I think
> > > >>> there
> > > >>>>> will
> > > >>>>>>> be issues that slip through.  That being said
we can rely on
> > > >> users
> > > >>> to
> > > >>>>>>> report these issues, and chances are we'll be
able to provide
> > > >>>> backwards
> > > >>>>>>> compatible patches.  At a minimum I'd recommend
we should run
> > > >> tests
> > > >>>> on
> > > >>>>>> all
> > > >>>>>>> supported CUDA versions before a release.
> > > >>>>>>>
> > > >>>>>>> -Kellen
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Sat, Jan 6, 2018 at 5:05 PM, Bhavin Thaker
<
> > > >>>> bhavinthaker@gmail.com>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi Kellen,
> > > >>>>>>>>
> > > >>>>>>>> 1) Does Apache MXNet (Incubating) have a support
matrix? I
> > > >> think
> > > >>>> the
> > > >>>>>>> answer
> > > >>>>>>>> is no, because I don’t know of where it
is documented. One of
> > > >> the
> > > >>>>>> mentors
> > > >>>>>>>> told me earlier that the community uses and
modifies the
> > > >>>> open-source
> > > >>>>>>>> project as per their individual  requirements
or those of the
> > > >>>>>> community.
> > > >>>>>>> As
> > > >>>>>>>> far as I know, there is no single entity that
is responsible
> > > >> for
> > > >>>>>>> supporting
> > > >>>>>>>> something in MXNet — corrections to my understanding
are
> > > >> welcome.
> > > >>>>>>>>
> > > >>>>>>>> 2) I think your question probably is what
should be tested by
> > > >> the
> > > >>>>>> Apache
> > > >>>>>>>> MXNet CI and NOT what is supported by Apache
MXNet, correct?
> > > >>>>>>>>
> > > >>>>>>>> If yes, I propose testing only the latest
CUDA9 and the
> > > >>> respective
> > > >>>>>> latest
> > > >>>>>>>> cuDNN version in the MXNet CI since CUDA9
is backward
> > > >> compatible
> > > >>>> with
> > > >>>>>>>> earlier Nvidia hardware generations.
> > > >>>>>>>>
> > > >>>>>>>> I would like to hear reasons why this would
not work.
> > > >>>>>>>>
> > > >>>>>>>> I have commented on the github issue as well:
> > > >>>>>>>> https://github.com/apache/incubator-mxnet/issues/8805
> > > >>>>>>>>
> > > >>>>>>>> Bhavin Thaker.
> > > >>>>>>>>
> > > >>>>>>>> On Sat, Jan 6, 2018 at 3:30 AM kellen sunderland
<
> > > >>>>>>>> kellen.sunderland@gmail.com> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Hello all, I'd like to propose that we
nail down exactly
> > > >> which
> > > >>>>>> versions
> > > >>>>>>>> of
> > > >>>>>>>>> CUDA we're supporting.  We can then ensure
that we've got
> > > >> good
> > > >>>> test
> > > >>>>>>>>> coverage for those specific versions in
CI.  At the moment
> > > >> it's
> > > >>>>>>> ambiguous
> > > >>>>>>>>> what our current policy is.  I.e. when
do we drop support for
> > > >>> old
> > > >>>>>>>>> versions?  As a result we potentially
cut a release promising
> > > >>> to
> > > >>>>>>> support
> > > >>>>>>>> a
> > > >>>>>>>>> certain version of CUDA, then retroactively
drop support
> > > >> after
> > > >>> we
> > > >>>>>> find
> > > >>>>>>> an
> > > >>>>>>>>> issue.
> > > >>>>>>>>>
> > > >>>>>>>>> I'd like to propose that we officially
support N, and N-1
> > > >>>> versions
> > > >>>>> of
> > > >>>>>>>> CUDA,
> > > >>>>>>>>> where N is the most recent major version
release.  In
> > > >> addition
> > > >>> we
> > > >>>>> can
> > > >>>>>>> do
> > > >>>>>>>>> our best to support libraries that are
available for download
> > > >>> for
> > > >>>>>> those
> > > >>>>>>>>> versions.  Supporting these CUDA versions
would also dictate
> > > >>>> which
> > > >>>>>>>> hardware
> > > >>>>>>>>> we support in terms of compute capability
(of course resource
> > > >>>>>>> constraints
> > > >>>>>>>>> would also play some role in our ability
to support some
> > > >>>> hardware).
> > > >>>>>>>>>
> > > >>>>>>>>> As an example this would mean that currently
we'd officially
> > > >>>>> support
> > > >>>>>>> CUDA
> > > >>>>>>>>> 9.* and 8.  This would imply we support
CUDNN 5.1 through 7,
> > > >> as
> > > >>>>> those
> > > >>>>>>>>> libraries are available for CUDA 8, and
9.  It would also
> > > >> mean
> > > >>> we
> > > >>>>>>> support
> > > >>>>>>>>> 3.0-7.x (Kepler, Maxwell, Pascal, Volta)
taking the more
> > > >>>>> restrictive
> > > >>>>>>>>> hardware requirements of CUDA 9 into account.
> > > >>>>>>>>>
> > > >>>>>>>>> What do you all think?  Would this be
a reasonable support
> > > >>>>> strategy?
> > > >>>>>>> Are
> > > >>>>>>>>> these the versions you'd like to see covered
in CI?
> > > >>>>>>>>>
> > > >>>>>>>>> -Kellen
> > > >>>>>>>>>
> > > >>>>>>>>> A relevant issue:
> > > >>>>>>> https://github.com/apache/incubator-mxnet/issues/8805
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> >
>  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message