spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Evan R. Sparks" <evan.spa...@gmail.com>
Subject Re: Using CUDA within Spark / boosting linear algebra
Date Thu, 26 Mar 2015 16:07:09 GMT
Alright Sam - you are the expert here. If the GPL issues are unavoidable,
that's fine - what is the exact bit of code that is GPL?

The suggestion to use OpenBLAS is not to say it's the best option, but that
it's a *free, reasonable default* for many users - keep in mind the most
common deployment for Spark/MLlib is on 64-bit linux on EC2[1].
Additionally, for many of the problems we're targeting, this reasonable
default can provide a 1-2 orders of magnitude improvement in performance
over the f2jblas implementation that netlib-java falls back on.

The JVM issues are trickier, I agree - so it sounds like a good user guide
explaining the tradeoffs and configurations procedures as they relate to
spark is a reasonable way forward.

[1] -
https://gigaom.com/2015/01/27/a-few-interesting-numbers-about-apache-spark/

On Thu, Mar 26, 2015 at 12:54 AM, Sam Halliday <sam.halliday@gmail.com>
wrote:

> Btw, OpenBLAS requires GPL runtime binaries which are typically considered
> "system libraries" (and these fall under something similar to the Java
> classpath exception rule)... so it's basically impossible to distribute
> OpenBLAS the way you're suggesting, sorry. Indeed, there is work ongoing in
> Spark right now to clear up something of this nature.
>
> On a more technical level, I'd recommend watching my talk at ScalaX which
> explains in detail why high performance only comes from machine optimised
> binaries, which requires DevOps buy-in (and, I'd recommend using MKL anyway
> on the CPU, not OpenBLAS).
>
> On an even deeper level, using natives has consequences to JIT and GC
> which isn't suitable for everybody and we'd really like people to go into
> that with their eyes wide open.
> On 26 Mar 2015 07:43, "Sam Halliday" <sam.halliday@gmail.com> wrote:
>
>> I'm not at all surprised ;-) I fully expect the GPU performance to get
>> better automatically as the hardware improves.
>>
>> Netlib natives still need to be shipped separately. I'd also oppose any
>> move to make Open BLAS the default - is not always better and I think
>> natives really need DevOps buy-in. It's not the right solution for
>> everybody.
>> On 26 Mar 2015 01:23, "Evan R. Sparks" <evan.sparks@gmail.com> wrote:
>>
>>> Yeah, much more reasonable - nice to know that we can get full GPU
>>> performance from breeze/netlib-java - meaning there's no compelling
>>> performance reason to switch out our current linear algebra library (at
>>> least as far as this benchmark is concerned).
>>>
>>> Instead, it looks like a user guide for configuring Spark/MLlib to use
>>> the right BLAS library will get us most of the way there. Or, would it make
>>> sense to finally ship openblas compiled for some common platforms (64-bit
>>> linux, windows, mac) directly with Spark - hopefully eliminating the jblas
>>> warnings once and for all for most users? (Licensing is BSD) Or am I
>>> missing something?
>>>
>>> On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander <
>>> alexander.ulanov@hp.com> wrote:
>>>
>>>> As everyone suggested, the results were too good to be true, so I
>>>> double-checked them. It turns that nvblas did not do multiplication due to
>>>> parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My
>>>> previously posted results with nvblas are matrices copying only. The
>>>> default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size.
I
>>>> handpicked other values that worked. As a result, netlib+nvblas is on par
>>>> with BIDMat-cuda. As promised, I am going to post a how-to for nvblas
>>>> configuration.
>>>>
>>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Ulanov, Alexander
>>>> Sent: Wednesday, March 25, 2015 2:31 PM
>>>> To: Sam Halliday
>>>> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R.
>>>> Sparks; jfcanny
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Hi again,
>>>>
>>>> I finally managed to use nvblas within Spark+netlib-java. It has
>>>> exceptional performance for big matrices with Double, faster than
>>>> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
>>>> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
>>>> original nvblas presentation on GPU conf 2013 (slide 21):
>>>> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>>>>
>>>> My results:
>>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>> Just in case, these tests are not for generalization of performance of
>>>> different libraries. I just want to pick a library that does at best dense
>>>> matrices multiplication for my task.
>>>>
>>>> P.S. My previous issue with nvblas was the following: it has Fortran
>>>> blas functions, at the same time netlib-java uses C cblas functions. So,
>>>> one needs cblas shared library to use nvblas through netlib-java. Fedora
>>>> does not have cblas (but Debian and Ubuntu have), so I needed to compile
>>>> it. I could not use cblas from Atlas or Openblas because they link to their
>>>> implementation and not to Fortran blas.
>>>>
>>>> Best regards, Alexander
>>>>
>>>> -----Original Message-----
>>>> From: Ulanov, Alexander
>>>> Sent: Tuesday, March 24, 2015 6:57 PM
>>>> To: Sam Halliday
>>>> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Hi,
>>>>
>>>> I am trying to use nvblas with netlib-java from Spark. nvblas functions
>>>> should replace current blas functions calls after executing LD_PRELOAD as
>>>> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
>>>> changes to netlib-java. It seems to work for simple Java example, but I
>>>> cannot make it work with Spark. I run the following:
>>>> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
>>>> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
>>>> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>>>>
>>>> +-----------------------------------------------------------------------------+
>>>> | Processes:                                                       GPU
>>>> Memory |
>>>> |  GPU       PID  Type  Process name
>>>>  Usage      |
>>>>
>>>> |=============================================================================|
>>>> |    0      8873    C   bash
>>>> 39MiB |
>>>> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java
>>>> 39MiB |
>>>>
>>>> +-----------------------------------------------------------------------------+
>>>>
>>>> In Spark shell I do matrix multiplication and see the following:
>>>> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
>>>> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
>>>> So I am sure that netlib-native is loaded and cblas supposedly used.
>>>> However, matrix multiplication does executes on CPU since I see 16% of CPU
>>>> used and 0% of GPU used. I also checked different matrix sizes, from
>>>> 100x100 to 12000x12000
>>>>
>>>> Could you suggest might the LD_PRELOAD not affect Spark shell?
>>>>
>>>> Best regards, Alexander
>>>>
>>>>
>>>>
>>>> From: Sam Halliday [mailto:sam.halliday@gmail.com]
>>>> Sent: Monday, March 09, 2015 6:01 PM
>>>> To: Ulanov, Alexander
>>>> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>>
>>>> Thanks so much for following up on this!
>>>>
>>>> Hmm, I wonder if we should have a concerted effort to chart performance
>>>> on various pieces of hardware...
>>>> On 9 Mar 2015 21:08, "Ulanov, Alexander" <alexander.ulanov@hp.com
>>>> <mailto:alexander.ulanov@hp.com>> wrote:
>>>> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added
>>>> the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see
>>>> the support of Double in the current source code), did the test with BIDMat
>>>> and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>>>>
>>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>> Best regards, Alexander
>>>>
>>>> -----Original Message-----
>>>> From: Sam Halliday [mailto:sam.halliday@gmail.com<mailto:
>>>> sam.halliday@gmail.com>]
>>>> Sent: Tuesday, March 03, 2015 1:54 PM
>>>> To: Xiangrui Meng; Joseph Bradley
>>>> Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org<mailto:
>>>> dev@spark.apache.org>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> BTW, is anybody on this list going to the London Meetup in a few weeks?
>>>>
>>>>
>>>> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
>>>>
>>>> Would be nice to meet other people working on the guts of Spark! :-)
>>>>
>>>>
>>>> Xiangrui Meng <mengxr@gmail.com<mailto:mengxr@gmail.com>> writes:
>>>>
>>>> > Hey Alexander,
>>>> >
>>>> > I don't quite understand the part where netlib-cublas is about 20x
>>>> > slower than netlib-openblas. What is the overhead of using a GPU BLAS
>>>> > with netlib-java?
>>>> >
>>>> > CC'ed Sam, the author of netlib-java.
>>>> >
>>>> > Best,
>>>> > Xiangrui
>>>> >
>>>> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <
>>>> joseph@databricks.com<mailto:joseph@databricks.com>> wrote:
>>>> >> Better documentation for linking would be very helpful!  Here's
a
>>>> JIRA:
>>>> >> https://issues.apache.org/jira/browse/SPARK-6019
>>>> >>
>>>> >>
>>>> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>>>> >> <evan.sparks@gmail.com<mailto:evan.sparks@gmail.com>>
>>>> >> wrote:
>>>> >>
>>>> >>> Thanks for compiling all the data and running these benchmarks,
>>>> >>> Alex. The big takeaways here can be seen with this chart:
>>>> >>>
>>>> >>>
>>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>>>> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>>> >>>
>>>> >>> 1) A properly configured GPU matrix multiply implementation
(e.g.
>>>> >>> BIDMat+GPU) can provide substantial (but less than an order
of
>>>> >>> BIDMat+magnitude)
>>>> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL
or
>>>> >>> netlib-java+openblas-compiled).
>>>> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>>>> >>> worse than a well-tuned CPU implementation, particularly for
larger
>>>> matrices.
>>>> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib
- this
>>>> >>> basically agrees with the authors own benchmarks (
>>>> >>> https://github.com/fommil/netlib-java)
>>>> >>>
>>>> >>> I think that most of our users are in a situation where using
GPUs
>>>> >>> may not be practical - although we could consider having a good
GPU
>>>> >>> backend available as an option. However, *ALL* users of MLlib
could
>>>> >>> benefit (potentially tremendously) from using a well-tuned CPU-based
>>>> >>> BLAS implementation. Perhaps we should consider updating the
mllib
>>>> >>> guide with a more complete section for enabling high performance
>>>> >>> binaries on OSX and Linux? Or better, figure out a way for the
>>>> >>> system to fetch these automatically.
>>>> >>>
>>>> >>> - Evan
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>>> >>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>>
wrote:
>>>> >>>
>>>> >>>> Just to summarize this thread, I was finally able to make
all
>>>> >>>> performance comparisons that we discussed. It turns out
that:
>>>> >>>> BIDMat-cublas>>BIDMat
>>>> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
>>>> >>>> =netlib-cublas>netlib-blas>f2jblas
>>>> >>>>
>>>> >>>> Below is the link to the spreadsheet with full results.
>>>> >>>>
>>>> >>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
>>>> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
>>>> >>>>
>>>> >>>> One thing still needs exploration: does BIDMat-cublas perform
>>>> >>>> copying to/from machine’s RAM?
>>>> >>>>
>>>> >>>> -----Original Message-----
>>>> >>>> From: Ulanov, Alexander
>>>> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>>> >>>> To: Evan R. Sparks
>>>> >>>> Cc: Joseph Bradley;
>>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org>
>>>> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>> >>>>
>>>> >>>> Thanks, Evan! It seems that ticket was marked as duplicate
though
>>>> >>>> the original one discusses slightly different topic. I was
able to
>>>> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
>>>> >>>> statically linked inside a 60MB library.
>>>> >>>>
>>>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>>> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas
|
>>>> >>>>
>>>> +-----------------------------------------------------------------------+
>>>> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 |
0,002556 |
>>>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>>> >>>> |1,638475459 |
>>>> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211
|
>>>> >>>> 1569,233228 |
>>>> >>>>
>>>> >>>> It turn out that pre-compiled MKL is faster than precompiled
>>>> >>>> OpenBlas on my machine. Probably, I’ll add two more columns
with
>>>> >>>> locally compiled openblas and cuda.
>>>> >>>>
>>>> >>>> Alexander
>>>> >>>>
>>>> >>>> From: Evan R. Sparks
>>>> >>>> [mailto:evan.sparks@gmail.com<mailto:evan.sparks@gmail.com>]
>>>> >>>> Sent: Monday, February 09, 2015 6:06 PM
>>>> >>>> To: Ulanov, Alexander
>>>> >>>> Cc: Joseph Bradley;
>>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org>
>>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>> >>>>
>>>> >>>> Great - perhaps we can move this discussion off-list and
onto a
>>>> >>>> JIRA ticket? (Here's one:
>>>> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
>>>> >>>>
>>>> >>>> It seems like this is going to be somewhat exploratory for
a while
>>>> >>>> (and there's probably only a handful of us who really care
about
>>>> >>>> fast linear
>>>> >>>> algebra!)
>>>> >>>>
>>>> >>>> - Evan
>>>> >>>>
>>>> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>>> >>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com><mailto:
>>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>>> wrote:
>>>> >>>> Hi Evan,
>>>> >>>>
>>>> >>>> Thank you for explanation and useful link. I am going to
build
>>>> >>>> OpenBLAS, link it with Netlib-java and perform benchmark
again.
>>>> >>>>
>>>> >>>> Do I understand correctly that BIDMat binaries contain statically
>>>> >>>> linked Intel MKL BLAS? It might be the reason why I am able
to run
>>>> >>>> BIDMat not having MKL BLAS installed on my server. If it
is true, I
>>>> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
>>>> >>>> it seems that in my case precompiled MKL BLAS performs better
than
>>>> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are
>>>> supposed to be on par with JNI overheads.
>>>> >>>>
>>>> >>>> Though, it might be interesting to link Netlib-java with
Intel MKL,
>>>> >>>> as you suggested. I wonder, are John Canny (BIDMat) and
Sam
>>>> >>>> Halliday
>>>> >>>> (Netlib-java) interested to compare their libraries.
>>>> >>>>
>>>> >>>> Best regards, Alexander
>>>> >>>>
>>>> >>>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com<mailto:
>>>> evan.sparks@gmail.com><mailto:
>>>> >>>> evan.sparks@gmail.com<mailto:evan.sparks@gmail.com>>]
>>>> >>>> Sent: Friday, February 06, 2015 5:58 PM
>>>> >>>>
>>>> >>>> To: Ulanov, Alexander
>>>> >>>> Cc: Joseph Bradley;
>>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark
>>>> .
>>>> >>>> apache.org<mailto:dev@spark.apache.org>>
>>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>> >>>>
>>>> >>>> I would build OpenBLAS yourself, since good BLAS performance
comes
>>>> >>>> from getting cache sizes, etc. set up correctly for your
particular
>>>> >>>> hardware - this is often a very tricky process (see, e.g.
ATLAS),
>>>> >>>> but we found that on relatively modern Xeon chips, OpenBLAS
builds
>>>> >>>> quickly and yields performance competitive with MKL.
>>>> >>>>
>>>> >>>> To make sure the right library is getting used, you have
to make
>>>> >>>> sure it's first on the search path - export
>>>> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick
here.
>>>> >>>>
>>>> >>>> For some examples of getting netlib-java setup on an ec2
node and
>>>> >>>> some example benchmarking code we ran a while back, see:
>>>> >>>> https://github.com/shivaram/matrix-bench
>>>> >>>>
>>>> >>>> In particular - build-openblas-ec2.sh shows you how to build
the
>>>> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
>>>> >>>> shows you how to get the path setup and get that library
picked up
>>>> by netlib-java.
>>>> >>>>
>>>> >>>> In this way - you could probably get cuBLAS set up to be
used by
>>>> >>>> netlib-java as well.
>>>> >>>>
>>>> >>>> - Evan
>>>> >>>>
>>>> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>>> >>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com><mailto:
>>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>>> wrote:
>>>> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java
to
>>>> >>>> force loading the right blas? For netlib, I there are few
JVM
>>>> >>>> flags, such as
>>>> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>>>> >>>> so I can force it to use Java implementation. Not sure I
>>>> understand how to force use a specific blas (not specific wrapper for blas).
>>>> >>>>
>>>> >>>> Btw. I have installed openblas (yum install openblas), so
I suppose
>>>> >>>> that netlib is using it.
>>>> >>>>
>>>> >>>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com<mailto:
>>>> evan.sparks@gmail.com><mailto:
>>>> >>>> evan.sparks@gmail.com<mailto:evan.sparks@gmail.com>>]
>>>> >>>> Sent: Friday, February 06, 2015 5:19 PM
>>>> >>>> To: Ulanov, Alexander
>>>> >>>> Cc: Joseph Bradley;
>>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark
>>>> .
>>>> >>>> apache.org<mailto:dev@spark.apache.org>>
>>>> >>>>
>>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>> >>>>
>>>> >>>> Getting breeze to pick up the right blas library is critical
for
>>>> >>>> performance. I recommend using OpenBLAS (or MKL, if you
already
>>>> have it).
>>>> >>>> It might make sense to force BIDMat to use the same underlying
BLAS
>>>> >>>> library as well.
>>>> >>>>
>>>> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>>> >>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com><mailto:
>>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>>> wrote:
>>>> >>>> Hi Evan, Joseph
>>>> >>>>
>>>> >>>> I did few matrix multiplication test and BIDMat seems to
be ~10x
>>>> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
>>>> >>>>
>>>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>>>> >>>> |native_system_linux_x86-64|
>>>> >>>> Breeze+Netlib-java f2jblas |
>>>> >>>>
>>>> +-----------------------------------------------------------------------+
>>>> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459
|
>>>> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
>>>> >>>> ||
>>>> >>>>
>>>> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB
RAM, Fedora
>>>> >>>> 19 Linux, Scala 2.11.
>>>> >>>>
>>>> >>>> Later I will make tests with Cuda. I need to install new
Cuda
>>>> >>>> version for this purpose.
>>>> >>>>
>>>> >>>> Do you have any ideas why breeze-netlib with native blas
is so much
>>>> >>>> slower than BIDMat MKL?
>>>> >>>>
>>>> >>>> Best regards, Alexander
>>>> >>>>
>>>> >>>> From: Joseph Bradley [mailto:joseph@databricks.com<mailto:
>>>> joseph@databricks.com><mailto:
>>>> >>>> joseph@databricks.com<mailto:joseph@databricks.com>>]
>>>> >>>> Sent: Thursday, February 05, 2015 5:29 PM
>>>> >>>> To: Ulanov, Alexander
>>>> >>>> Cc: Evan R. Sparks;
>>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark
>>>> .
>>>> >>>> apache.org<mailto:dev@spark.apache.org>>
>>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>> >>>>
>>>> >>>> Hi Alexander,
>>>> >>>>
>>>> >>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> >>>> Concerning your question earlier about keeping data stored
on the
>>>> >>>> GPU rather than having to move it between main memory and
GPU
>>>> >>>> memory on each iteration, I would guess this would be critical
to
>>>> >>>> getting good performance.  If you could do multiple local
>>>> >>>> iterations before aggregating results, then the cost of
data
>>>> >>>> movement to the GPU could be amortized (and I believe that
is done
>>>> >>>> in practice).  Having Spark be aware of the GPU and using
it as
>>>> another part of memory sounds like a much bigger undertaking.
>>>> >>>>
>>>> >>>> Joseph
>>>> >>>>
>>>> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>>> >>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com><mailto:
>>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>>> wrote:
>>>> >>>> Thank you for explanation! I’ve watched the BIDMach presentation
by
>>>> >>>> John Canny and I am really inspired by his talk and comparisons
>>>> with Spark MLlib.
>>>> >>>>
>>>> >>>> I am very interested to find out what will be better within
Spark:
>>>> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you
suggest a
>>>> >>>> fair way to benchmark them? Currently I do benchmarks on
artificial
>>>> >>>> neural networks in batch mode. While it is not a “pure”
test of
>>>> >>>> linear algebra, it involves some other things that are essential
>>>> to machine learning.
>>>> >>>>
>>>> >>>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com<mailto:
>>>> evan.sparks@gmail.com><mailto:
>>>> >>>> evan.sparks@gmail.com<mailto:evan.sparks@gmail.com>>]
>>>> >>>> Sent: Thursday, February 05, 2015 1:29 PM
>>>> >>>> To: Ulanov, Alexander
>>>> >>>> Cc:
>>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark
>>>> .
>>>> >>>> apache.org<mailto:dev@spark.apache.org>>
>>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>> >>>>
>>>> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster
than
>>>> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably
due to
>>>> >>>> netlib-java+data
>>>> >>>> layout and fewer levels of indirection - it's definitely
a
>>>> >>>> worthwhile experiment to run. The main speedups I've seen
from
>>>> >>>> using it come from highly optimized GPU code for linear
algebra. I
>>>> >>>> know that in the past Canny has gone as far as to write
custom GPU
>>>> >>>> kernels for performance-critical regions of code.[1]
>>>> >>>>
>>>> >>>> BIDMach is highly optimized for single node performance
or
>>>> >>>> performance on small clusters.[2] Once data doesn't fit
easily in
>>>> >>>> GPU memory (or can be batched in that way) the performance
tends to
>>>> >>>> fall off. Canny argues for hardware/software codesign and
as such
>>>> >>>> prefers machine configurations that are quite different
than what
>>>> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels
>>>> and 4 GPUs.
>>>> >>>>
>>>> >>>> In contrast, MLlib was designed for horizontal scalability
on
>>>> >>>> commodity clusters and works best on very big datasets -
order of
>>>> terabytes.
>>>> >>>>
>>>> >>>> For the most part, these projects developed concurrently
to address
>>>> >>>> slightly different use cases. That said, there may be bits
of
>>>> >>>> BIDMach we could repurpose for MLlib - keep in mind we need
to be
>>>> >>>> careful about maintaining cross-language compatibility for
our Java
>>>> >>>> and Python-users, though.
>>>> >>>>
>>>> >>>> - Evan
>>>> >>>>
>>>> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>>>> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>>> >>>>
>>>> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>>> >>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com><mailto:
>>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>><mailto:
>>>> >>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com><mailto:
>>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>>>>
wrote:
>>>> >>>> Hi Evan,
>>>> >>>>
>>>> >>>> Thank you for suggestion! BIDMat seems to have terrific
speed. Do
>>>> >>>> you know what makes them faster than netlib-java?
>>>> >>>>
>>>> >>>> The same group has BIDMach library that implements machine
>>>> >>>> learning. For some examples they use Caffe convolutional
neural
>>>> >>>> network library owned by another group in Berkeley. Could
you
>>>> >>>> elaborate on how these all might be connected with Spark
Mllib? If
>>>> >>>> you take BIDMat for linear algebra why don’t you take
BIDMach for
>>>> optimization and learning?
>>>> >>>>
>>>> >>>> Best regards, Alexander
>>>> >>>>
>>>> >>>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com<mailto:
>>>> evan.sparks@gmail.com><mailto:
>>>> >>>> evan.sparks@gmail.com<mailto:evan.sparks@gmail.com>><mailto:
>>>> evan.sparks@gmail.com<mailto:evan.sparks@gmail.com><mailto:
>>>> >>>> evan.sparks@gmail.com<mailto:evan.sparks@gmail.com>>>]
>>>> >>>> Sent: Thursday, February 05, 2015 12:09 PM
>>>> >>>> To: Ulanov, Alexander
>>>> >>>> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:
>>>> dev@spark.apache.org<mailto:dev@spark.apache.org>><mailto:
>>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark
>>>> .
>>>> >>>> apache.org<mailto:dev@spark.apache.org>>>
>>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>> >>>>
>>>> >>>> I'd expect that we can make GPU-accelerated BLAS faster
than CPU
>>>> >>>> blas in many cases.
>>>> >>>>
>>>> >>>> You might consider taking a look at the codepaths that BIDMat
(
>>>> >>>> https://github.com/BIDData/BIDMat) takes and comparing them
to
>>>> >>>> netlib-java/breeze. John Canny et. al. have done a bunch
of work
>>>> >>>> optimizing to make this work really fast from Scala. I've
run it on
>>>> >>>> my laptop and compared to MKL and in certain cases it's
10x faster
>>>> at matrix multiply.
>>>> >>>> There are a lot of layers of indirection here and you really
want
>>>> >>>> to avoid data copying as much as possible.
>>>> >>>>
>>>> >>>> We could also consider swapping out BIDMat for Breeze, but
that
>>>> >>>> would be a big project and if we can figure out how to get
>>>> >>>> breeze+cublas to comparable performance that would be a
big win.
>>>> >>>>
>>>> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>>> >>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com><mailto:
>>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>><mailto:
>>>> >>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com><mailto:
>>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>>>>
wrote:
>>>> >>>> Dear Spark developers,
>>>> >>>>
>>>> >>>> I am exploring how to make linear algebra operations faster
within
>>>> Spark.
>>>> >>>> One way of doing this is to use Scala Breeze library that
is
>>>> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
>>>> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
>>>> >>>> and LAPACK native binaries if they are available on the
worker
>>>> >>>> node. It also has its own optimized Java implementation
of BLAS. It
>>>> >>>> is worth mentioning, that native binaries provide better
>>>> performance only for BLAS level 3, i.e.
>>>> >>>> matrix-matrix operations or general matrix multiplication
(GEMM).
>>>> >>>> This is confirmed by GEMM test on Netlib-java page
>>>> >>>> https://github.com/fommil/netlib-java. I also confirmed
it with my
>>>> >>>> experiments with training of artificial neural network
>>>> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>>> >>>> However, I would like to boost performance more.
>>>> >>>>
>>>> >>>> GPU is supposed to work fast with linear algebra and there
is
>>>> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have
one Linux
>>>> >>>> server with Nvidia GPU and I was able to do the following.
I linked
>>>> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper
and put
>>>> >>>> it into Spark, so Breeze/Netlib is using it. Then I did
some
>>>> >>>> performance measurements with regards to artificial neural
network
>>>> >>>> batch learning in Spark MLlib that involves matrix-matrix
>>>> >>>> multiplications. It turns out that for matrices of size
less than
>>>> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
becomes
>>>> >>>> slower for bigger matrices. It worth mentioning that it
is was not
>>>> a test for ONLY multiplication since there are other operations involved.
>>>> >>>> One of the reasons for slowdown might be the overhead of
copying
>>>> >>>> the matrices from computer memory to graphic card memory
and back.
>>>> >>>>
>>>> >>>> So, few questions:
>>>> >>>> 1) Do these results with CUDA make sense?
>>>> >>>> 2) If the problem is with copy overhead, are there any libraries
>>>> >>>> that allow to force intermediate results to stay in graphic
card
>>>> >>>> memory thus removing the overhead?
>>>> >>>> 3) Any other options to speed-up linear algebra in Spark?
>>>> >>>>
>>>> >>>> Thank you, Alexander
>>>> >>>>
>>>> >>>> -------------------------------------------------------------------
>>>> >>>> -- To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>> <mailto:dev-unsubscribe@spark.apache.org><mailto:
>>>> >>>> dev-unsubscribe@spark.apache.org<mailto:
>>>> dev-unsubscribe@spark.apach
>>>> >>>> e.org>><mailto:dev-unsubscribe@spark.apac<mailto:
>>>> dev-unsubscribe@sp
>>>> >>>> ark.apac> he.org<http://he.org>
>>>> >>>> <mailto:dev-unsubscribe@spark.apache.org<mailto:
>>>> dev-unsubscribe@spa
>>>> >>>> rk.apache.org>>> For additional commands, e-mail:
>>>> >>>> dev-help@spark.apache.org<mailto:dev-help@spark.apache.org
>>>> ><mailto:
>>>> >>>> dev-help@spark.apache.org<mailto:dev-help@spark.apache.org
>>>> >><mailto:dev-help@spark.apache.org<mailto:dev-help@spark.apache.org
>>>> ><mailto:
>>>> >>>> dev-help@spark.apache.org<mailto:dev-help@spark.apache.org>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>
>>>>
>>>> --
>>>> Best regards,
>>>> Sam
>>>>
>>>
>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message