systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Boehm <>
Subject Re: Performance differences between SystemML LibMatrixMult and Breeze with native BLAS
Date Thu, 01 Dec 2016 00:00:48 GMT
ok, then let's sort this out one by one

1) Benchmarks: There are a couple of things we should be aware of for 
these native/java benchmarks. First, please specify k as the number of 
logical cores on your machine and use a sufficiently large heap with 
Xms=Xmx and Xmn=0.1*Xmx. Second, exclude the initial warmup runs for JIT 
compilation or outliers where GC happened from these measurements.

2) Breeze Comparison: Please also get the breeze numbers without native 
BLAS libraries as another baseline with comparable runtime platform.

3) Bigger Picture: Just to clarify the overall question here - of course 
native BLAS libraries are expected to be faster for squared (or similar) 
dense matrix multiply, as current JDKs usually only compile scalar but 
no packed SIMD instructions for these operations. How much depends on 
the architecture. On older architectures with 128bit and 256bit vector 
units, it was not too problematic. But the trend continues and hence it 
is worth thinking about it if nothing happens on the JDK front. The 
reasons why we decided for platform independence in the past were as 

(a) Squared dense matrix multiply is not a common operation (other than 
in DL). Much more common are memory-bandwidth bound matrix-vector 
multiplications and there it actually leads to a 3x slowdown copying 
your data out to a native library.
(b) In end-to-end algorithms, especially on large-scale scenarios, we 
often see other factors dominating performance.
(c) Keeping the build and deployment simple without the dependency to 
native libraries was the logical conclusion given (a) and (b).
(d) There are also workarounds: A user can always (and we did this in 
the past with certain LAPACK functions), define an external function and 
call there whatever library she wants.


On 12/1/2016 12:27 AM, wrote:
> This is the printout from 50 iterations with timings decommented:
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 465.897145
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 389.913848
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 426.539142
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 391.878792
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 349.830464
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 284.751495
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 337.790165
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 363.655144
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 334.348717
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 745.822571
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 1257.83537
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 313.253455
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 268.226473
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 252.079117
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 254.162898
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 257.962804
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 279.462628
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 240.553724
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 269.316559
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 245.755306
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 266.528604
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 240.022494
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 269.964251
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 246.011221
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 309.174575
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 254.311429
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 262.97415
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 256.096419
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 293.975642
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 262.577342
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 287.840992
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 293.495411
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 253.541925
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 293.485217
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 266.114958
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 260.231448
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 260.012622
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 267.912608
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 264.265422
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 276.937746
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 261.649393
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 245.334056
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 258.506884
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 243.960491
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 251.801208
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 271.235477
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 275.290229
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 251.290325
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 265.851277
> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 240.902494
> Am 01.12.2016 00:08 schrieb Matthias Boehm:
>> Could you please make sure you're comparing the right thing. Even on
>> old sandy bridge CPUs our matrix mult for 1kx1k usually takes 40-50ms.
>> We also did the same experiments with larger matrices and SystemML was
>> about 2x faster compared to Breeze. Please decomment the timings in
>> LibMatrixMult.matrixMult and double check the timing as well as that
>> we're actually comparing dense matrix multiply.
>> Regards,
>> Matthias
>> On 11/30/2016 11:54 PM, wrote:
>>> Hi all,
>>> I have run a very quick comparison between SystemML's LibMatrixMult and
>>> Breeze matrix multiplication using native BLAS (OpenBLAS through
>>> netlib-java). As per my very small comparison I get the result that
>>> there is a performance difference for dense-dense Matrices of size 1000
>>> x 1000 (our default blocksize) with Breeze being about 5-6 times faster
>>> here. The code I used can be found here:
>>> Running this code with 50 iterations each gives me for example average
>>> times of:
>>> Breeze:         49.74 ms
>>> SystemML:   363.44 ms
>>> I don't want to say this is true for every operation, but those results
>>> let us form the hypothesis that native BLAS operations can lead to a
>>> significant speedup for certain operations which is worth testing with
>>> more advanced benchmarks.
>>> Btw: I am definitely not saying we should use Breeze here. I am more
>>> looking at native BLAS and LAPACK implementations in general (as
>>> provided by OpenBLAS, MKL, etc.).
>>> Let me know what you think!
>>> Felix

View raw message