systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nakul Jindal <>
Subject Re: Build and distribution related issues for GPU support
Date Fri, 02 Dec 2016 20:03:02 GMT

Thanks for your questions :)
This will help point us to a public discussion about the decision to put
the ptx under version control.

>From what I understand, we compile for a certain virtual architecture and
for a certain GPU (using the -code and -arch).
Currently, we compile for sm_20.
This ptx is good for "higher" REAL architectures also (sm_30, sm_32. sm_35,
sm 50, sm_52, sm_53).

Further Reading / References:

So to answer your first question, whether it will run on Kepler devices -
Yes, it will, because it is higher than sm_20.

For your second question - is there a performance diff between CUBIN and
PTX - yes there is.
CUBINs are compiled for a target architecture, PTX is for the virtual GPU
ISA (forward compatible) which is compiled at runtime by the JIT.
There is a startup cost. This post describes approaches to mitigate that
startup cost:
The blog post suggest either shipping a fat cubin - which has the ptx and
compiled code for more than one target GPU architecture - or using JIT
caching, which is controlled by setting environment variables.

Shipping a fat cubin is obviously much more heavy weight than just the ptx.
Realistically, the PTX JIT compilation adds about <5 seconds of startup
overhead (on the platforms I tested on), if the "-gpu" flag option is used.
It can be argued that in a long running job, a constant cost is justified.


On Thu, Nov 24, 2016 at 12:53 AM, Matthias Boehm <>

> So just to make sure I understand correctly: right now we compiled the few
> example kernels with PTX version 4.3, implying that this is the minimum
> requirement and SystemML's GPU backend will not run, for example, on Kepler
> devices (with PTX version 3), right?
> Also, is there a performance difference (generated code, or just-in-time
> compilation overhead) between CUBIN and PTX files? If so, can we quantify
> this difference to make a decision here? Thanks.
> Regards,
> Matthias
> On 11/24/2016 8:34 AM, Nakul Jindal wrote:
>> @Matthias -
>> PTX (parallel thread execution) objects are intermediate compiled objects.
>> As of the current master, they are maintained under git version control.
>> This decision was agreed upon after discussing the hassle that a developer
>> of systemml without the nvidia cuda compiler might face.
>> It was decided that a person modifying the .cu files will be responsible
>> for regenerating the .ptx file and committing it to version control.
>> So far, between the active developers of systemml, this practice has not
>> disrupted their regular workflow.
>> About PTX version.
>> Newer PTX versions support newer architectures. As and when we upgrade to
>> newer CUDA versions, we shall use the cuda compiler that ships with that
>> version of the toolkit and compile the .cu files in the project and commit
>> the resulting .ptx files.
>> Thoughts, comments?
>> -Nakul
>> On Wed, Nov 23, 2016 at 2:43 PM, Matthias Boehm <>
>> wrote:
>> thanks for sharing Nakul. Could you please also comment on the PTX story
>>> for custom kernels and different PTX versions?
>>> Regards,
>>> Matthias
>>> On 11/23/2016 10:13 PM, Nakul Jindal wrote:
>>> Hi,
>>>> SystemML has experimental GPU support, which we are working to solidify.
>>>> Currently, GPU is supported in CP (Standalone/Single Node) mode. It
>>>> uses a
>>>> single GPU (even if the node has more than 1 GPU).
>>>> Communication between the GPU and JVM happens through JCuda (MIT
>>>> License)
>>>> -
>>>> a light java wrapper over CUDA that uses JNI. To that end, JCuda needs
>>>> to
>>>> compile a platform specific shared library which is then used to
>>>> communicate with the locally installed Cuda.
>>>> To help with not having to compile a piece of C/C++ code each time, we
>>>> use
>>>> a project Mavenized-Jcuda(MIT-License). This project internally has a
>>>> repository which contains compiled shared objects (for JCuda) for
>>>> different
>>>> platforms for different versions of Cuda.
>>>> For developers of SystemML (People who compile SystemML from source) :
>>>> As of today, one can checkout the master branch and follow a series of
>>>> setup steps to get SystemML in GPU mode running.
>>>> These are the steps -
>>>> docs/devdocs/
>>>> 1a)
>>>> Broadly,
>>>> 0. Compile systemml & mavenized jcuda.
>>>> 1. Mavenized JCuda jars are put into the classpath of SystemML.
>>>> 2. The native shared library should be put in the LD_LIBRARY_PATH or
>>>> java.library.path.
>>>> 3. SystemML should be run with the "-gpu" flag. Like so:
>>>> (In the incubator-systemml directory)
>>>> bin/systemml "file.dml" -gpu force=true
>>>> PR 291 ( tries to
>>>> change this so that setup becomes simpler.  (Given that mavenized-jcuda
>>>> is
>>>> available in one of the repositories specified in systemml's pom.xml)
>>>> 1b)
>>>> 0. Compule systemml
>>>> 1. Run systemml
>>>> bin/systemml "file.dml" -gpu force=true
>>>> For users of SystemML:
>>>> We haven't yet decided on how to ship SystemML with GPU support. Here
>>>> are
>>>> the 2 ways we can think of:
>>>> 2a)
>>>> 0. User installs pre-requisites (java, cuda, etc)
>>>> 1. User "installs" Mavenized-JCuda or JCuda. (i.e. the package jars are
>>>> made available in the classpath). Also the relevant shared object
>>>> library
>>>> files (.so, .dll) files are made available to the JVM through the
>>>> LD_LIBRARY_PATH environment variable or through java.library.path
>>>> setting
>>>> variable. (Note this needs to happen if using cuda <8.0)
>>>> 2. Download and run the systemml jar.
>>>> 2b)
>>>> We package JCuda/Mavenized-JCuda with the SystemML distribution. We
>>>> already
>>>> package ANTLR and Wink with our jar. Our other dependencies are
>>>> "provided"
>>>> scope and are not pulled in by the maven shade plugin.
>>>> A separate jar will be released for every platform.
>>>> 0. User installs pre-requisites
>>>> 1. Download and run systemml jar
>>>> There is also the matter of running SystemML with GPU in distributed
>>>> mode.
>>>> In hybrid_spark mode, with option 2a, we'd need to install
>>>> JCuda/Mavenized-JCuda on all the worker nodes.
>>>> With option 2b, we wouldn't need to.
>>>> Berthold, Niketan and I have had a discussion and agree on option 2a,
>>>> for
>>>> now.
>>>> Are there any thoughts? Inputs?
>>>> -Nakul Jindal

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message