Subject Re: Performance differences between SystemML LibMatrixMult and Breeze with native BLAS
Date Wed, 30 Nov 2016 23:27:55 GMT
This is the printout from 50 iterations with timings decommented:

MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 465.897145
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 389.913848
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 426.539142
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 391.878792
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 349.830464
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 284.751495
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 337.790165
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 363.655144
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 334.348717
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 745.822571
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 1257.83537
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 313.253455
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 268.226473
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 252.079117
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 254.162898
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 257.962804
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 279.462628
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 240.553724
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 269.316559
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 245.755306
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 266.528604
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 240.022494
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 269.964251
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 246.011221
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 309.174575
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 254.311429
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 262.97415
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 256.096419
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 293.975642
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 262.577342
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 287.840992
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 293.495411
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 253.541925
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 293.485217
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 266.114958
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 260.231448
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 260.012622
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 267.912608
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 264.265422
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 276.937746
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 261.649393
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 245.334056
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 258.506884
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 243.960491
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 251.801208
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 271.235477
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 275.290229
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 251.290325
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 265.851277
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 240.902494

Am 01.12.2016 00:08 schrieb Matthias Boehm:
> Could you please make sure you're comparing the right thing. Even on
> old sandy bridge CPUs our matrix mult for 1kx1k usually takes 40-50ms.
> We also did the same experiments with larger matrices and SystemML was
> about 2x faster compared to Breeze. Please decomment the timings in
> LibMatrixMult.matrixMult and double check the timing as well as that
> we're actually comparing dense matrix multiply.
> Regards,
> Matthias
> On 11/30/2016 11:54 PM, wrote:
>> Hi all,
>> I have run a very quick comparison between SystemML's LibMatrixMult 
>> and
>> Breeze matrix multiplication using native BLAS (OpenBLAS through
>> netlib-java). As per my very small comparison I get the result that
>> there is a performance difference for dense-dense Matrices of size 
>> 1000
>> x 1000 (our default blocksize) with Breeze being about 5-6 times 
>> faster
>> here. The code I used can be found here:
>> Running this code with 50 iterations each gives me for example average
>> times of:
>> Breeze:         49.74 ms
>> SystemML:   363.44 ms
>> I don't want to say this is true for every operation, but those 
>> results
>> let us form the hypothesis that native BLAS operations can lead to a
>> significant speedup for certain operations which is worth testing with
>> more advanced benchmarks.
>> Btw: I am definitely not saying we should use Breeze here. I am more
>> looking at native BLAS and LAPACK implementations in general (as
>> provided by OpenBLAS, MKL, etc.).
>> Let me know what you think!
>> Felix

