Mailing-List: contact dev-help@systemml.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@systemml.incubator.apache.org
MIME-Version: 1.0
In-Reply-To: <20160525130338.930C7B2050@b01ledav03.gho.pok.ibm.com>
Subject: Re: Discussion on GPU backend
To: dev@systemml.incubator.apache.org
From: "Niketan Pansare" <npansar@us.ibm.com>
Date: Wed, 25 May 2016 08:08:18 -0700
References: <201605032026.u43KQOxp030340@d01av01.pok.ibm.com><CAGU5spdzaB=a_c4cL4vg0uOqCnH2_DVVgRv_05fW41G1oQaiTg@mail.gmail.com><201605180550.u4I5ohtP012606@d01av03.pok.ibm.com><CAO1aQR-moRpD=0aAkkEE32cx-c-PnZKH_kjJGC2NTJNHrQ5CRQ@mail.gmail.com><201605181755.u4IHt6mO010963@d01av01.pok.ibm.com>
 <CAO1aQR8jad9VLsaMFWE-WV8or7txWpv+_1Yu0jyYWnf+ZdaTLg@mail.gmail.com>
 <20160524185503.D47A6AE04B@b01ledav005.gho.pok.ibm.com>
 <20160525041008.A9E3428046@b01ledav001.gho.pok.ibm.com>
 <20160525130338.930C7B2050@b01ledav03.gho.pok.ibm.com>
Content-type: multipart/related;
	Boundary="0__=8FBBF52DDFC278398f9e8a93df938690918c8FBBF52DDFC27839"
Message-Id: <20160525150827.3A3E96A05A@b03ledav003.gho.boulder.ibm.com>
archived-at: Wed, 25 May 2016 15:08:51 -0000

--0__=8FBBF52DDFC278398f9e8a93df938690918c8FBBF52DDFC27839
Content-type: multipart/alternative;
	Boundary="1__=8FBBF52DDFC278398f9e8a93df938690918c8FBBF52DDFC27839"


--1__=8FBBF52DDFC278398f9e8a93df938690918c8FBBF52DDFC27839
Content-Transfer-Encoding: quoted-printable
Content-type: text/plain; charset=US-ASCII


Thanks Berthold and Matthias for your suggestions. It is important to note
whether we go with (A) or (B), the initial PR will be squashed in one
commit and individual commits by external contributor will be lost in the
process. However, since we are planning to go with option (3), the impact
won't be too severe.

Matthias: Here are my thoughts regarding the unknowns for GPU backend:
1. Handling of native libraries:
Both JCuda and Nvidia provide shared libraries/DLL for most OS/platforms
along with installation instructions.

For deployment:
As per the previous email, the native libraries will be treated as an
external dependency, just like hadoop/spark. For example: if someone
executes: "hadoop jar SystemML.jar -f test.dml -exec hybrid=5Fspark", she
will get "Class Not Found" exception. In similar fashion, if the user
doesnot include JCu*.jar or provide native libraries (JCu*.dll/so or CUDA
or CuDNN) and supplies "-accelerator" flag, a "Class not found" or "Cannot
load .." exception will be thrown respectively. If user doesnot supply
"-accelerator" flag, SystemML will proceed will normal execution as it does
today.

For dev:
We are planning to host jcu*.jar into one of maven repository. Once that's
done, the "system" scope in pom will be replaced by "provided" scope and
the jcu*.jars will be deleted from PR. Like deployment, it is
responsibility of the developer to install native libraries if she intends
to work on GPU backend.

For testing:
The user can set the environment variable "CUDA=5FPATH" and set TEST=5FGPU =
flag
to enable GPU tests (Please see
https://github.com/apache/incubator-systemml/pull/165/files#diff-bcda036e4c=
3ff62cb2648acbbd19f61aR113
). The PR will be accompanied by additional tests which will be enabled
only when TEST=5FGPU is set. Having TEST=5FGPU flag allows users without Nv=
idia
GPU to run the integration test. Like deployment, it is responsibility of
the developer to install native libraries for testing with TEST=5FGPU flag.

The first version will not contain custom native kernels.

2. I can add the summary of the performance comparisons in the PR :)

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=3Dus-npansar


From:	Berthold Reinwald/Almaden/IBM@IBMUS
To:	dev@systemml.incubator.apache.org
Date:	05/25/2016 06:03 AM
Subject:	Re: Discussion on GPU backend


the discussion is less about (1), (2), or (3). As practiced so far, (3) is
the way to go.

The question is about (A) or (B). Curious was the Apache suggested
practice is.

Regards,
Berthold Reinwald
IBM Almaden Research Center
office: (408) 927 2208; T/L: 457 2208
e-mail: reinwald@us.ibm.com


From:   Matthias Boehm/Almaden/IBM@IBMUS
To:     dev@systemml.incubator.apache.org
Date:   05/24/2016 09:10 PM
Subject:        Re: Discussion on GPU backend


Generally, I think we should really stick to (3) as done in the past,
i.e., bring up major features in the roadmap discussions, create jira
epics and try to break them into rather isolated tasks. This works for
almost any major/minor feature. The only exception are features, where it
is initially unknown if the potential benefits outweigh the increased
complexity (or other disadvantages). Here, prototypes are required but
everybody should be free to choose a way of maintaining them. I also don't
expect too much collaboration here because of the unknown status. Once the
initial unknowns are resolved, we should come back to (3) tough.

Regarding the GPU backend, the unknowns to resolve are (1) the handling of
native libraries/kernels for deployment/test/dev, and (2) performance
comparisons on selected algorithms (prototypes, not fully integrated),
data sizes, and platforms. Once we have answers to these questions, we can
create all the tasks for optimizer/runtime integration.

Regards,
Matthias


Niketan Pansare---05/24/2016 11:55:19 AM---Hi all, Since there is interest
in collaborating on GPU backend, I wanted to know

From: Niketan Pansare/Almaden/IBM@IBMUS
To: dev@systemml.incubator.apache.org
Date: 05/24/2016 11:55 AM
Subject: Re: Discussion on GPU backend


Hi all,

Since there is interest in collaborating on GPU backend, I wanted to know
what is the preferred way to go ahead with a new feature (i.e. GPU
backend) ? This discussion is also generally applicable to other major
features (for example: Flink backend, Deep Learning support, OpenCL
backend, new data types, new built-in functions, new algorithms, etc).

The first point of discussion is what would qualify as a "major feature"
and how we integrate it into SystemML ? Here are three options that could
serve as a potential requirement:
1. The feature has to be fully functional and fully optimized. For
example: in the case of additional backends, the PR can only be merged in
if and only if, all the instructions (CP or distributed) has been
implemented and is at least as optimized as our existing alternate
backends. In the case of algorithms or the built-in functions, the PR can
only be merged in if and only if, it runs on all the backends for all
datasets and is comparable in performance and accuracy with an external ML
libraries.
2. The feature has to be fully functional. In this case, the PR can only
be merged in if and only if all the instructions (CP or distributed) has
been implemented. However, the first version of the new backend need not
perform faster than our existing alternate backends.
3. Increment addition but with unit testcases that addresses quality and
stability concerns. In this case, a PR can be merged if a subset of
instructions has been implemented along with set of unit test cases
suggested by our committers. The main benefit here is quick-feedback
iterations from our committers, whereas the main drawback is an
intermediate state where we don't fully support the given backend for
certain scenario.

If we decide to go with option 1 or 2, then potentially there will be a
lot of code to review at the end and ideally we should give opportunity
for our committers to provide early review comments on the feature. This
will mitigate the risk of having to re-implement the entire feature. The
options here are:
A. Create a branch on https://github.com/apache/incubator-systemml. This
allows people to collaborate as well as allows committers to look at the
code.
B. Create a branch on a fork and have a PR up to allow committers to raise
concerns and provide suggestions. This is done for
https://github.com/apache/incubator-systemml/pull/165 and
https://github.com/apache/incubator-systemml/pull/119. To collaborate, the
person creating PR will act as committer for the feature and will accept
PR on its branch and will be responsible for resolving conflicts and
keeping the PR in sync with the master.

If we decide to go with the option 3 (i.e. incremental addition), the
option B seems to be logical choice as we already do this for other
features.

My goal here is not to create a formal process but instead to avoid any
potential misunderstanding/confusion and also to follow recommended Apache
practices.

Please email back with your thoughts :)

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=3Dus-npansar

Deron Eriksson ---05/18/2016 11:22:26 AM---Hi Niketan, Good idea, I think
that would be the cleanest solution for now. Since JCuda

From: Deron Eriksson <deroneriksson@gmail.com>
To: dev@systemml.incubator.apache.org
Date: 05/18/2016 11:22 AM
Subject: Re: Discussion on GPU backend


Hi Niketan,

Good idea, I think that would be the cleanest solution for now. Since
JCuda
doesn't appear to be in a public maven repo, it adds a layer of difficulty
to clean integration via maven builds.

Deron


On Wed, May 18, 2016 at 10:55 AM, Niketan Pansare <npansar@us.ibm.com>
wrote:

> Hi Deron,
>
> Good points. I vote that we keep JCUDA and other accelerators we add as
an
> external dependency. This means the user will have to ensure JCuda.jar
in
> the class path and JCuda.DLL/JCuda.so in the LD=5FLIBRARY=5FPATH.
>
> I don't think JCuda.jar is platform-specific.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=3Dus-npansar
>
> [image: Inactive hide details for Deron Eriksson ---05/18/2016 10:51:17
> AM---Hi, I'm wondering what would be a good way to handle JCuda]Deron
> Eriksson ---05/18/2016 10:51:17 AM---Hi, I'm wondering what would be a
good
> way to handle JCuda in terms of the
>
> From: Deron Eriksson <deroneriksson@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 05/18/2016 10:51 AM
> Subject: Re: Discussion on GPU backend
> ------------------------------
>
>
>
> Hi,
>
> I'm wondering what would be a good way to handle JCuda in terms of the
> build release packages. Currently we have 11 artifacts that we are
> building:
>   systemml-0.10.0-incubating-SNAPSHOT-inmemory.jar
>   systemml-0.10.0-incubating-SNAPSHOT-javadoc.jar
>   systemml-0.10.0-incubating-SNAPSHOT-sources.jar
>   systemml-0.10.0-incubating-SNAPSHOT-src.tar.gz
>   systemml-0.10.0-incubating-SNAPSHOT-src.zip
>   systemml-0.10.0-incubating-SNAPSHOT-standalone.jar
>   systemml-0.10.0-incubating-SNAPSHOT-standalone.tar.gz
>   systemml-0.10.0-incubating-SNAPSHOT-standalone.zip
>   systemml-0.10.0-incubating-SNAPSHOT.jar
>   systemml-0.10.0-incubating-SNAPSHOT.tar.gz
>   systemml-0.10.0-incubating-SNAPSHOT.zip
>
> It looks like JCuda is platform-specific, so you typically need
different
> jars/dlls/sos/etc for each platform. If I'm understanding things
correctly,
> if we generated Windows/Linux/LinuxPowerPC/MacOS-specific SystemML
> artifacts for JCuda, we'd potentially have an enormous number of
artifacts.
>
> Is this something that could be potentially handled by specific profiles
in
> the pom so that a user might be able to do something like "mvn clean
> package -P jcuda-windows" so that a user could be responsible for
building
> the platform-specific SystemML jar for jcuda? Or is this something that
> could be handled differently, by putting the platform-specific jcuda jar
on
> the classpath and any dlls or other needed libraries on the path?
>
> Deron
>
>
>
> On Tue, May 17, 2016 at 10:50 PM, Niketan Pansare <npansar@us.ibm.com>
> wrote:
>
> > Hi Luciano,
> >
> > Like all our backends, there is no change in the programming model.
The
> > user submits a DML script and specifies whether she wants to use an
> > accelerator. Assuming that we compile jcuda jars into SystemML.jar,
the
> > user can use GPU backend using following command:
> > spark-submit --master yarn-client ... -f MyAlgo.dml -accelerator -exec
> > hybrid=5Fspark
> >
> > The user also needs to set LD=5FLIBRARY=5FPATH that points to JCuda DLL=
 or
so
> > files. Please see *https://issues.apache.org/jira/browse/SPARK-1720*
> > <https://issues.apache.org/jira/browse/SPARK-1720> ... For example:
the
>
> > user can add following to spark-env.sh
> > export LD=5FLIBRARY=5FPATH=3D<path to jcuda so>:$LD=5FLIBRARY=5FPATH
> >
> > The first version of GPU backend will only accelerate CP. In this
case,
> we
> > have four types of instructions:
> > 1. CP
> > 2. GPU (requires GPU on the driver)
> > 3. SPARK
> > 4. MR
> >
> > Note, the first version will require the CUDA/JCuda dependency to be
> > installed on the driver only.
> >
> > The next version will accelerate our distributed instructions as well.
In
> > this case, we will have six types of instructions:
> > 1. CP
> > 2. GPU
> > 3. SPARK
> > 4. MR
> > 5. SPARK-GPU (requires GPU cluster)
> > 6. MR-GPU (requires GPU cluster)
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> >
> http://researcher.watson.ibm.com/researcher/view.php?person=3Dus-npansar
>
> >
> > [image: Inactive hide details for Luciano Resende ---05/17/2016
09:13:24
> > PM---Great to see detailed information on this topic Niketan,]Luciano
> > Resende ---05/17/2016 09:13:24 PM---Great to see detailed information
on
> > this topic Niketan, I guess I have missed when you posted it in
> >
> > From: Luciano Resende <luckbr1975@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 05/17/2016 09:13 PM
> > Subject: Re: Discussion on GPU backend
> > ------------------------------
>
> >
> >
> >
> > Great to see detailed information on this topic Niketan, I guess I
have
> > missed when you posted it initially.
> >
> > Could you elaborate a little more on what is the programming model for
> when
> > the user wants to leverage GPU ? Also, today I can submit a job to
spark
> > using --jars and it will handle copying the dependencies to the worker
> > nodes. If my application wants to leverage GPU, what extras
dependencies
> > will be required on the worker nodes, and how they are going to be
> > installed/updated on the Spark cluster ?
> >
> >
> >
> > On Tue, May 3, 2016 at 1:26 PM, Niketan Pansare <npansar@us.ibm.com>
> > wrote:
> >
> > >
> > >
> > > Hi all,
> > >
> > > I have updated the design document for our GPU backend in the JIRA
> > >
> https://issues.apache.org/jira/browse/SYSTEMML-445. The implementation
>
> > > details are based on the prototype I created and is available in PR
> > >
> https://github.com/apache/incubator-systemml/pull/131. Once we are done
>
> > > with the discussion, I can clean up and separate out the GPU backend
> in a
> > > separate PR for easier review :)
> > >
> > > Here are key design points:
> > > A GPU backend would implement two abstract classes:
> > >    1.   GPUContext
> > >    2.   GPUObject
> > >
> > >
> > >
> > > The GPUContext is responsible for GPU memory management and gets
> > call-backs
> > > from SystemML's bufferpool on following methods:
> > >    1.   void acquireRead(MatrixObject mo)
> > >    2.   void acquireModify(MatrixObject mo)
> > >    3.   void release(MatrixObject mo, boolean isGPUCopyModified)
> > >    4.   void exportData(MatrixObject mo)
> > >    5.   void evict(MatrixObject mo)
> > >
> > >
> > >
> > > A GPUObject (like RDDObject and BroadcastObject) is stored in
> > CacheableData
> > > object. It contains following methods that are called back from the
> > > corresponding GPUContext:
> > >    1.   void allocateMemoryOnDevice()
> > >    2.   void deallocateMemoryOnDevice()
> > >    3.   long getSizeOnDevice()
> > >    4.   void copyFromHostToDevice()
> > >    5.   void copyFromDeviceToHost()
> > >
> > >
> > >
> > > In the initial implementation, we will add JCudaContext and
> JCudaPointer
> > > that will extend the above abstract classes respectively. The
> > JCudaContext
> > > will be created by ExecutionContextFactory depending on the
> > user-specified
> > > accelarator. Analgous to MR/SPARK/CP, we will add a new ExecType:
GPU
> and
> > > implement GPU instructions.
> > >
> > > The above design is general enough so that other people can
implement
> > > custom accelerators (for example: OpenCL) and also follows the
design
> > > principles of our CP bufferpool.
> > >
> > > Thanks,
> > >
> > > Niketan Pansare
> > > IBM Almaden Research Center
> > > E-mail: npansar At us.ibm.com
> > >
> http://researcher.watson.ibm.com/researcher/view.php?person=3Dus-npansar
>
> > >
> >
> >
> >
> > --
> > Luciano Resende
> > http://twitter.com/lresende1975
> > http://lresende.blogspot.com/
> >
> >
> >
> >
>
>
>
>


--1__=8FBBF52DDFC278398f9e8a93df938690918c8FBBF52DDFC27839
Content-Transfer-Encoding: quoted-printable
Content-type: text/html; charset=US-ASCII
Content-Disposition: inline

<html><body><p>Thanks Berthold and Matthias for your suggestions. It is imp=
ortant to note whether we go with (A) or (B), the initial PR will be squash=
ed in one commit and individual commits by external contributor will be los=
t in the process. However, since we are planning to go with option (3), the=
 impact won't be too severe.<br><br>Matthias: Here are my thoughts regardin=
g the unknowns for GPU backend:<br>1. Handling of native libraries:<br>Both=
 JCuda and Nvidia provide shared libraries/DLL for most OS/platforms along =
with installation instructions.<br><br>For deployment:<br>As per the previo=
us email, the native libraries will be treated as an external dependency, j=
ust like hadoop/spark. For example: if someone executes: &quot;hadoop jar S=
ystemML.jar -f test.dml -exec hybrid=5Fspark&quot;, she will get &quot;Clas=
s Not Found&quot; exception. In similar fashion, if the user doesnot includ=
e JCu*.jar or provide native libraries (JCu*.dll/so or CUDA or CuDNN) and s=
upplies &quot;-accelerator&quot; flag, a &quot;Class not found&quot; or &qu=
ot;Cannot load ..&quot; exception will be thrown respectively. If user does=
not supply &quot;-accelerator&quot; flag, SystemML will proceed will normal=
 execution as it does today. <br><br>For dev:<br>We are planning to host jc=
u*.jar into one of maven repository. Once that's done, the &quot;system&quo=
t; scope in pom will be replaced by &quot;provided&quot; scope and the jcu*=
.jars will be deleted from PR. Like deployment, it is responsibility of the=
 developer to install native libraries if she intends to work on GPU backen=
d.<br><br>For testing:<br>The user can set the environment variable &quot;C=
UDA=5FPATH&quot; and set TEST=5FGPU flag to enable GPU tests (Please see <a=
 href=3D"https://github.com/apache/incubator-systemml/pull/165/files#diff-b=
cda036e4c3ff62cb2648acbbd19f61aR113">https://github.com/apache/incubator-sy=
stemml/pull/165/files#diff-bcda036e4c3ff62cb2648acbbd19f61aR113</a>). The P=
R will be accompanied by additional tests which will be enabled only when T=
EST=5FGPU is set. Having TEST=5FGPU flag allows users without Nvidia GPU to=
 run the integration test. Like deployment, it is responsibility of the dev=
eloper to install native libraries for testing with TEST=5FGPU flag. <br><b=
r>The first version will not contain custom native kernels. <br><br>2. I ca=
n add the summary of the performance comparisons in the PR :)<br><br>Thanks=
,<br><br>Niketan Pansare<br>IBM Almaden Research Center<br>E-mail: npansar =
At us.ibm.com<br><a href=3D"http://researcher.watson.ibm.com/researcher/vie=
w.php?person=3Dus-npansar">http://researcher.watson.ibm.com/researcher/view=
.php?person=3Dus-npansar</a><br><br><img width=3D"16" height=3D"16" src=3D"=
cid:1=5F=5F=3D8FBBF52DDFC278398f9e8a93df938690918c8FB@" border=3D"0" alt=3D=
"Inactive hide details for Berthold Reinwald---05/25/2016 06:03:55 AM---the=
 discussion is less about (1), (2), or (3). As practi"><font color=3D"#4242=
82">Berthold Reinwald---05/25/2016 06:03:55 AM---the discussion is less abo=
ut (1), (2), or (3). As practiced so far, (3) is  the way to go.</font><br>=
<br><font size=3D"2" color=3D"#5F5F5F">From:        </font><font size=3D"2"=
>Berthold Reinwald/Almaden/IBM@IBMUS</font><br><font size=3D"2" color=3D"#5=
F5F5F">To:        </font><font size=3D"2">dev@systemml.incubator.apache.org=
</font><br><font size=3D"2" color=3D"#5F5F5F">Date:        </font><font siz=
e=3D"2">05/25/2016 06:03 AM</font><br><font size=3D"2" color=3D"#5F5F5F">Su=
bject:        </font><font size=3D"2">Re: Discussion on GPU backend</font><=
br><hr width=3D"100%" size=3D"2" align=3D"left" noshade style=3D"color:#809=
1A5; "><br><br><br><tt>the discussion is less about (1), (2), or (3). As pr=
acticed so far, (3) is <br>the way to go.<br><br>The question is about (A) =
or (B). Curious was the Apache suggested <br>practice is.<br><br>Regards,<b=
r>Berthold Reinwald<br>IBM Almaden Research Center<br>office: (408) 927 220=
8; T/L: 457 2208<br>e-mail: reinwald@us.ibm.com<br><br><br><br>From: &nbsp;=
 Matthias Boehm/Almaden/IBM@IBMUS<br>To: &nbsp; &nbsp; dev@systemml.incubat=
or.apache.org<br>Date: &nbsp; 05/24/2016 09:10 PM<br>Subject: &nbsp; &nbsp;=
 &nbsp; &nbsp;Re: Discussion on GPU backend<br><br><br><br>Generally, I thi=
nk we should really stick to (3) as done in the past, <br>i.e., bring up ma=
jor features in the roadmap discussions, create jira <br>epics and try to b=
reak them into rather isolated tasks. This works for <br>almost any major/m=
inor feature. The only exception are features, where it <br>is initially un=
known if the potential benefits outweigh the increased <br>complexity (or o=
ther disadvantages). Here, prototypes are required but <br>everybody should=
 be free to choose a way of maintaining them. I also don't <br>expect too m=
uch collaboration here because of the unknown status. Once the <br>initial =
unknowns are resolved, we should come back to (3) tough.<br><br>Regarding t=
he GPU backend, the unknowns to resolve are (1) the handling of <br>native =
libraries/kernels for deployment/test/dev, and (2) performance <br>comparis=
ons on selected algorithms (prototypes, not fully integrated), <br>data siz=
es, and platforms. Once we have answers to these questions, we can <br>crea=
te all the tasks for optimizer/runtime integration. <br><br>Regards,<br>Mat=
thias <br><br><br>Niketan Pansare---05/24/2016 11:55:19 AM---Hi all, Since =
there is interest <br>in collaborating on GPU backend, I wanted to know<br>=
<br>From: Niketan Pansare/Almaden/IBM@IBMUS<br>To: dev@systemml.incubator.a=
pache.org<br>Date: 05/24/2016 11:55 AM<br>Subject: Re: Discussion on GPU ba=
ckend<br><br><br><br>Hi all,<br><br>Since there is interest in collaboratin=
g on GPU backend, I wanted to know <br>what is the preferred way to go ahea=
d with a new feature (i.e. GPU <br>backend) ? This discussion is also gener=
ally applicable to other major <br>features (for example: Flink backend, De=
ep Learning support, OpenCL <br>backend, new data types, new built-in funct=
ions, new algorithms, etc).<br><br>The first point of discussion is what wo=
uld qualify as a &quot;major feature&quot; <br>and how we integrate it into=
 SystemML ? Here are three options that could <br>serve as a potential requ=
irement:<br>1. The feature has to be fully functional and fully optimized. =
For <br>example: in the case of additional backends, the PR can only be mer=
ged in <br>if and only if, all the instructions (CP or distributed) has bee=
n <br>implemented and is at least as optimized as our existing alternate <b=
r>backends. In the case of algorithms or the built-in functions, the PR can=
 <br>only be merged in if and only if, it runs on all the backends for all =
<br>datasets and is comparable in performance and accuracy with an external=
 ML <br>libraries.<br>2. The feature has to be fully functional. In this ca=
se, the PR can only <br>be merged in if and only if all the instructions (C=
P or distributed) has <br>been implemented. However, the first version of t=
he new backend need not <br>perform faster than our existing alternate back=
ends.<br>3. Increment addition but with unit testcases that addresses quali=
ty and <br>stability concerns. In this case, a PR can be merged if a subset=
 of <br>instructions has been implemented along with set of unit test cases=
 <br>suggested by our committers. The main benefit here is quick-feedback <=
br>iterations from our committers, whereas the main drawback is an <br>inte=
rmediate state where we don't fully support the given backend for <br>certa=
in scenario. <br><br>If we decide to go with option 1 or 2, then potentiall=
y there will be a <br>lot of code to review at the end and ideally we shoul=
d give opportunity <br>for our committers to provide early review comments =
on the feature. This <br>will mitigate the risk of having to re-implement t=
he entire feature. The <br>options here are:<br>A. Create a branch on </tt>=
<tt><a href=3D"https://github.com/apache/incubator-systemml">https://github=
.com/apache/incubator-systemml</a></tt><tt>. This <br>allows people to coll=
aborate as well as allows committers to look at the <br>code.<br>B. Create =
a branch on a fork and have a PR up to allow committers to raise <br>concer=
ns and provide suggestions. This is done for <br></tt><tt><a href=3D"https:=
//github.com/apache/incubator-systemml/pull/165">https://github.com/apache/=
incubator-systemml/pull/165</a></tt><tt>&nbsp;and <br></tt><tt><a href=3D"h=
ttps://github.com/apache/incubator-systemml/pull/119">https://github.com/ap=
ache/incubator-systemml/pull/119</a></tt><tt>. To collaborate, the <br>pers=
on creating PR will act as committer for the feature and will accept <br>PR=
 on its branch and will be responsible for resolving conflicts and <br>keep=
ing the PR in sync with the master.<br><br>If we decide to go with the opti=
on 3 (i.e. incremental addition), the <br>option B seems to be logical choi=
ce as we already do this for other <br>features.<br><br>My goal here is not=
 to create a formal process but instead to avoid any <br>potential misunder=
standing/confusion and also to follow recommended Apache <br>practices.<br>=
<br>Please email back with your thoughts :)<br><br>Thanks,<br><br>Niketan P=
ansare<br>IBM Almaden Research Center<br>E-mail: npansar At us.ibm.com<br><=
/tt><tt><a href=3D"http://researcher.watson.ibm.com/researcher/view.php?per=
son=3Dus-npansar">http://researcher.watson.ibm.com/researcher/view.php?pers=
on=3Dus-npansar</a></tt><tt><br><br>Deron Eriksson ---05/18/2016 11:22:26 A=
M---Hi Niketan, Good idea, I think <br>that would be the cleanest solution =
for now. Since JCuda<br><br>From: Deron Eriksson &lt;deroneriksson@gmail.co=
m&gt;<br>To: dev@systemml.incubator.apache.org<br>Date: 05/18/2016 11:22 AM=
<br>Subject: Re: Discussion on GPU backend<br><br><br><br>Hi Niketan,<br><b=
r>Good idea, I think that would be the cleanest solution for now. Since <br=
>JCuda<br>doesn't appear to be in a public maven repo, it adds a layer of d=
ifficulty<br>to clean integration via maven builds.<br><br>Deron<br><br><br=
>On Wed, May 18, 2016 at 10:55 AM, Niketan Pansare &lt;npansar@us.ibm.com&g=
t;<br>wrote:<br><br>&gt; Hi Deron,<br>&gt;<br>&gt; Good points. I vote that=
 we keep JCUDA and other accelerators we add as <br>an<br>&gt; external dep=
endency. This means the user will have to ensure JCuda.jar <br>in<br>&gt; t=
he class path and JCuda.DLL/JCuda.so in the LD=5FLIBRARY=5FPATH.<br>&gt;<br=
>&gt; I don't think JCuda.jar is platform-specific.<br>&gt;<br>&gt; Thanks,=
<br>&gt;<br>&gt; Niketan Pansare<br>&gt; IBM Almaden Research Center<br>&gt=
; E-mail: npansar At us.ibm.com<br>&gt; </tt><tt><a href=3D"http://research=
er.watson.ibm.com/researcher/view.php?person=3Dus-npansar">http://researche=
r.watson.ibm.com/researcher/view.php?person=3Dus-npansar</a></tt><tt><br>&g=
t;<br>&gt; [image: Inactive hide details for Deron Eriksson ---05/18/2016 1=
0:51:17<br>&gt; AM---Hi, I'm wondering what would be a good way to handle J=
Cuda]Deron<br>&gt; Eriksson ---05/18/2016 10:51:17 AM---Hi, I'm wondering w=
hat would be a <br>good<br>&gt; way to handle JCuda in terms of the<br>&gt;=
<br>&gt; From: Deron Eriksson &lt;deroneriksson@gmail.com&gt;<br>&gt; To: d=
ev@systemml.incubator.apache.org<br>&gt; Date: 05/18/2016 10:51 AM<br>&gt; =
Subject: Re: Discussion on GPU backend<br>&gt; ----------------------------=
--<br>&gt;<br>&gt;<br>&gt;<br>&gt; Hi,<br>&gt;<br>&gt; I'm wondering what w=
ould be a good way to handle JCuda in terms of the<br>&gt; build release pa=
ckages. Currently we have 11 artifacts that we are<br>&gt; building:<br>&gt=
; &nbsp; systemml-0.10.0-incubating-SNAPSHOT-inmemory.jar<br>&gt; &nbsp; sy=
stemml-0.10.0-incubating-SNAPSHOT-javadoc.jar<br>&gt; &nbsp; systemml-0.10.=
0-incubating-SNAPSHOT-sources.jar<br>&gt; &nbsp; systemml-0.10.0-incubating=
-SNAPSHOT-src.tar.gz<br>&gt; &nbsp; systemml-0.10.0-incubating-SNAPSHOT-src=
.zip<br>&gt; &nbsp; systemml-0.10.0-incubating-SNAPSHOT-standalone.jar<br>&=
gt; &nbsp; systemml-0.10.0-incubating-SNAPSHOT-standalone.tar.gz<br>&gt; &n=
bsp; systemml-0.10.0-incubating-SNAPSHOT-standalone.zip<br>&gt; &nbsp; syst=
emml-0.10.0-incubating-SNAPSHOT.jar<br>&gt; &nbsp; systemml-0.10.0-incubati=
ng-SNAPSHOT.tar.gz<br>&gt; &nbsp; systemml-0.10.0-incubating-SNAPSHOT.zip<b=
r>&gt;<br>&gt; It looks like JCuda is platform-specific, so you typically n=
eed <br>different<br>&gt; jars/dlls/sos/etc for each platform. If I'm under=
standing things <br>correctly,<br>&gt; if we generated Windows/Linux/LinuxP=
owerPC/MacOS-specific SystemML<br>&gt; artifacts for JCuda, we'd potentiall=
y have an enormous number of <br>artifacts.<br>&gt;<br>&gt; Is this somethi=
ng that could be potentially handled by specific profiles <br>in<br>&gt; th=
e pom so that a user might be able to do something like &quot;mvn clean<br>=
&gt; package -P jcuda-windows&quot; so that a user could be responsible for=
 <br>building<br>&gt; the platform-specific SystemML jar for jcuda? Or is t=
his something that<br>&gt; could be handled differently, by putting the pla=
tform-specific jcuda jar <br>on<br>&gt; the classpath and any dlls or other=
 needed libraries on the path?<br>&gt;<br>&gt; Deron<br>&gt;<br>&gt;<br>&gt=
;<br>&gt; On Tue, May 17, 2016 at 10:50 PM, Niketan Pansare &lt;npansar@us.=
ibm.com&gt;<br>&gt; wrote:<br>&gt;<br>&gt; &gt; Hi Luciano,<br>&gt; &gt;<br=
>&gt; &gt; Like all our backends, there is no change in the programming mod=
el. <br>The<br>&gt; &gt; user submits a DML script and specifies whether sh=
e wants to use an<br>&gt; &gt; accelerator. Assuming that we compile jcuda =
jars into SystemML.jar, <br>the<br>&gt; &gt; user can use GPU backend using=
 following command:<br>&gt; &gt; spark-submit --master yarn-client ... -f M=
yAlgo.dml -accelerator -exec<br>&gt; &gt; hybrid=5Fspark<br>&gt; &gt;<br>&g=
t; &gt; The user also needs to set LD=5FLIBRARY=5FPATH that points to JCuda=
 DLL or <br>so<br>&gt; &gt; files. Please see *https://issues.apache.org/ji=
ra/browse/SPARK-1720*<br>&gt; &gt; &lt;</tt><tt><a href=3D"https://issues.a=
pache.org/jira/browse/SPARK-1720">https://issues.apache.org/jira/browse/SPA=
RK-1720</a></tt><tt>&gt; ... For example: <br>the<br>&gt;<br>&gt; &gt; user=
 can add following to spark-env.sh<br>&gt; &gt; export LD=5FLIBRARY=5FPATH=
=3D&lt;path to jcuda so&gt;:$LD=5FLIBRARY=5FPATH<br>&gt; &gt;<br>&gt; &gt; =
The first version of GPU backend will only accelerate CP. In this <br>case,=
<br>&gt; we<br>&gt; &gt; have four types of instructions:<br>&gt; &gt; 1. C=
P<br>&gt; &gt; 2. GPU (requires GPU on the driver)<br>&gt; &gt; 3. SPARK<br=
>&gt; &gt; 4. MR<br>&gt; &gt;<br>&gt; &gt; Note, the first version will req=
uire the CUDA/JCuda dependency to be<br>&gt; &gt; installed on the driver o=
nly.<br>&gt; &gt;<br>&gt; &gt; The next version will accelerate our distrib=
uted instructions as well. <br>In<br>&gt; &gt; this case, we will have six =
types of instructions:<br>&gt; &gt; 1. CP<br>&gt; &gt; 2. GPU<br>&gt; &gt; =
3. SPARK<br>&gt; &gt; 4. MR<br>&gt; &gt; 5. SPARK-GPU (requires GPU cluster=
)<br>&gt; &gt; 6. MR-GPU (requires GPU cluster)<br>&gt; &gt;<br>&gt; &gt; T=
hanks,<br>&gt; &gt;<br>&gt; &gt; Niketan Pansare<br>&gt; &gt; IBM Almaden R=
esearch Center<br>&gt; &gt; E-mail: npansar At us.ibm.com<br>&gt; &gt;<br>&=
gt; </tt><tt><a href=3D"http://researcher.watson.ibm.com/researcher/view.ph=
p?person=3Dus-npansar">http://researcher.watson.ibm.com/researcher/view.php=
?person=3Dus-npansar</a></tt><tt><br>&gt;<br>&gt; &gt;<br>&gt; &gt; [image:=
 Inactive hide details for Luciano Resende ---05/17/2016 <br>09:13:24<br>&g=
t; &gt; PM---Great to see detailed information on this topic Niketan,]Lucia=
no<br>&gt; &gt; Resende ---05/17/2016 09:13:24 PM---Great to see detailed i=
nformation <br>on<br>&gt; &gt; this topic Niketan, I guess I have missed wh=
en you posted it in<br>&gt; &gt;<br>&gt; &gt; From: Luciano Resende &lt;luc=
kbr1975@gmail.com&gt;<br>&gt; &gt; To: dev@systemml.incubator.apache.org<br=
>&gt; &gt; Date: 05/17/2016 09:13 PM<br>&gt; &gt; Subject: Re: Discussion o=
n GPU backend<br>&gt; &gt; ------------------------------<br>&gt;<br>&gt; &=
gt;<br>&gt; &gt;<br>&gt; &gt;<br>&gt; &gt; Great to see detailed informatio=
n on this topic Niketan, I guess I <br>have<br>&gt; &gt; missed when you po=
sted it initially.<br>&gt; &gt;<br>&gt; &gt; Could you elaborate a little m=
ore on what is the programming model for<br>&gt; when<br>&gt; &gt; the user=
 wants to leverage GPU ? Also, today I can submit a job to <br>spark<br>&gt=
; &gt; using --jars and it will handle copying the dependencies to the work=
er<br>&gt; &gt; nodes. If my application wants to leverage GPU, what extras=
 <br>dependencies<br>&gt; &gt; will be required on the worker nodes, and ho=
w they are going to be<br>&gt; &gt; installed/updated on the Spark cluster =
?<br>&gt; &gt;<br>&gt; &gt;<br>&gt; &gt;<br>&gt; &gt; On Tue, May 3, 2016 a=
t 1:26 PM, Niketan Pansare &lt;npansar@us.ibm.com&gt;<br>&gt; &gt; wrote:<b=
r>&gt; &gt;<br>&gt; &gt; &gt;<br>&gt; &gt; &gt;<br>&gt; &gt; &gt; Hi all,<b=
r>&gt; &gt; &gt;<br>&gt; &gt; &gt; I have updated the design document for o=
ur GPU backend in the JIRA<br>&gt; &gt; &gt;<br>&gt; </tt><tt><a href=3D"ht=
tps://issues.apache.org/jira/browse/SYSTEMML-445">https://issues.apache.org=
/jira/browse/SYSTEMML-445</a></tt><tt>. The implementation<br>&gt;<br>&gt; =
&gt; &gt; details are based on the prototype I created and is available in =
PR<br>&gt; &gt; &gt;<br>&gt; </tt><tt><a href=3D"https://github.com/apache/=
incubator-systemml/pull/131">https://github.com/apache/incubator-systemml/p=
ull/131</a></tt><tt>. Once we are done<br>&gt;<br>&gt; &gt; &gt; with the d=
iscussion, I can clean up and separate out the GPU backend<br>&gt; in a<br>=
&gt; &gt; &gt; separate PR for easier review :)<br>&gt; &gt; &gt;<br>&gt; &=
gt; &gt; Here are key design points:<br>&gt; &gt; &gt; A GPU backend would =
implement two abstract classes:<br>&gt; &gt; &gt; &nbsp; &nbsp;1. &nbsp; GP=
UContext<br>&gt; &gt; &gt; &nbsp; &nbsp;2. &nbsp; GPUObject<br>&gt; &gt; &g=
t;<br>&gt; &gt; &gt;<br>&gt; &gt; &gt;<br>&gt; &gt; &gt; The GPUContext is =
responsible for GPU memory management and gets<br>&gt; &gt; call-backs<br>&=
gt; &gt; &gt; from SystemML's bufferpool on following methods:<br>&gt; &gt;=
 &gt; &nbsp; &nbsp;1. &nbsp; void acquireRead(MatrixObject mo)<br>&gt; &gt;=
 &gt; &nbsp; &nbsp;2. &nbsp; void acquireModify(MatrixObject mo)<br>&gt; &g=
t; &gt; &nbsp; &nbsp;3. &nbsp; void release(MatrixObject mo, boolean isGPUC=
opyModified)<br>&gt; &gt; &gt; &nbsp; &nbsp;4. &nbsp; void exportData(Matri=
xObject mo)<br>&gt; &gt; &gt; &nbsp; &nbsp;5. &nbsp; void evict(MatrixObjec=
t mo)<br>&gt; &gt; &gt;<br>&gt; &gt; &gt;<br>&gt; &gt; &gt;<br>&gt; &gt; &g=
t; A GPUObject (like RDDObject and BroadcastObject) is stored in<br>&gt; &g=
t; CacheableData<br>&gt; &gt; &gt; object. It contains following methods th=
at are called back from the<br>&gt; &gt; &gt; corresponding GPUContext:<br>=
&gt; &gt; &gt; &nbsp; &nbsp;1. &nbsp; void allocateMemoryOnDevice()<br>&gt;=
 &gt; &gt; &nbsp; &nbsp;2. &nbsp; void deallocateMemoryOnDevice()<br>&gt; &=
gt; &gt; &nbsp; &nbsp;3. &nbsp; long getSizeOnDevice()<br>&gt; &gt; &gt; &n=
bsp; &nbsp;4. &nbsp; void copyFromHostToDevice()<br>&gt; &gt; &gt; &nbsp; &=
nbsp;5. &nbsp; void copyFromDeviceToHost()<br>&gt; &gt; &gt;<br>&gt; &gt; &=
gt;<br>&gt; &gt; &gt;<br>&gt; &gt; &gt; In the initial implementation, we w=
ill add JCudaContext and<br>&gt; JCudaPointer<br>&gt; &gt; &gt; that will e=
xtend the above abstract classes respectively. The<br>&gt; &gt; JCudaContex=
t<br>&gt; &gt; &gt; will be created by ExecutionContextFactory depending on=
 the<br>&gt; &gt; user-specified<br>&gt; &gt; &gt; accelarator. Analgous to=
 MR/SPARK/CP, we will add a new ExecType: <br>GPU<br>&gt; and<br>&gt; &gt; =
&gt; implement GPU instructions.<br>&gt; &gt; &gt;<br>&gt; &gt; &gt; The ab=
ove design is general enough so that other people can <br>implement<br>&gt;=
 &gt; &gt; custom accelerators (for example: OpenCL) and also follows the <=
br>design<br>&gt; &gt; &gt; principles of our CP bufferpool.<br>&gt; &gt; &=
gt;<br>&gt; &gt; &gt; Thanks,<br>&gt; &gt; &gt;<br>&gt; &gt; &gt; Niketan P=
ansare<br>&gt; &gt; &gt; IBM Almaden Research Center<br>&gt; &gt; &gt; E-ma=
il: npansar At us.ibm.com<br>&gt; &gt; &gt;<br>&gt; </tt><tt><a href=3D"htt=
p://researcher.watson.ibm.com/researcher/view.php?person=3Dus-npansar">http=
://researcher.watson.ibm.com/researcher/view.php?person=3Dus-npansar</a></t=
t><tt><br>&gt;<br>&gt; &gt; &gt;<br>&gt; &gt;<br>&gt; &gt;<br>&gt; &gt;<br>=
&gt; &gt; --<br>&gt; &gt; Luciano Resende<br>&gt; &gt; </tt><tt><a href=3D"=
http://twitter.com/lresende1975">http://twitter.com/lresende1975</a></tt><t=
t><br>&gt; &gt; </tt><tt><a href=3D"http://lresende.blogspot.com/">http://l=
resende.blogspot.com/</a></tt><tt><br>&gt; &gt;<br>&gt; &gt;<br>&gt; &gt;<b=
r>&gt; &gt;<br>&gt;<br>&gt;<br>&gt;<br>&gt;<br><br><br><br><br><br><br><br>=
</tt><br><br><BR>
</body></html>

--1__=8FBBF52DDFC278398f9e8a93df938690918c8FBBF52DDFC27839--


--0__=8FBBF52DDFC278398f9e8a93df938690918c8FBBF52DDFC27839--