No problem. It was a big headache for my team as well. One of us already reimplemented it from
scratch, as seen in this pending PR for our project. https://github.com/hailis/hail/pull/1895
Hopefully you find that useful. We'll hopefully try to PR that into Spark at some point.
Best,
John
Sent from my iPhone
> On Jun 14, 2017, at 8:28 PM, Anthony Thomas <ahthomas@eng.ucsd.edu> wrote:
>
> Interesting, thanks! That probably also explains why there seems to be a ton of shuffle
for this operation. So what's the best option for truly scalable matrix multiplication on
Spark then  implementing from scratch using the coordinate matrix ((i,j), k) format?
>
>> On Wed, Jun 14, 2017 at 4:29 PM, John Compitello <johnc@broadinstitute.org>
wrote:
>> Hey Anthony,
>>
>> You're the first person besides myself I've seen mention this. BlockMatrix multiply
is not the best method. As far as me and my team can tell, the memory problem stems from the
fact that when Spark tries to compute block (i, j) of the matrix, it tries to manifest all
of row i from matrix 1 and all of column j from matrix 2 in memory at once on one executor.
Then after doing that, it proceeds to combine them with a functional reduce, creating one
additional block for each pair. So you end up manifesting 3n + logn matrix blocks in memory
at once, which is why it sucks so much.
>>
>> Sent from my iPhone
>>
>>> On Jun 14, 2017, at 7:07 PM, Anthony Thomas <ahthomas@eng.ucsd.edu> wrote:
>>>
>>> I've been experimenting with MlLib's BlockMatrix for distributed matrix multiplication
but consistently run into problems with executors being killed due to memory constrains. The
linked gist (here) has a short example of multiplying a 25,000 x 25,000 square matrix taking
approximately 5G of disk with a vector (also stored as a BlockMatrix). I am running this on
a 3 node (1 master + 2 workers) cluster on Amazon EMR using the m4.xlarge instance type. Each
instance has 16GB of RAM and 4 CPU. The gist has detailed information about the Spark environment.
>>>
>>> I have tried reducing the block size of the matrix, increasing the number of
partitions in the underlying RDD, increasing defaultParallelism and increasing spark.yarn.executor.memoryOverhead
(up to 3GB)  all without success. The input matrix should fit comfortably in distributed
memory and the resulting matrix should be quite small (25,000 x 1) so I'm confused as to why
Spark seems to want so much memory for this operation, and why Spark isn't spilling to disk
here if it wants more memory. The job does eventually complete successfully, but for larger
matrices stages have to be repeated several times which leads to long run times. I don't encounter
any issues if I reduce the matrix size down to about 3GB. Can anyone with experience using
MLLib's matrix operators provide any suggestions about what settings to look at, or what the
hard constraints on memory for BlockMatrix multiplication are?
>>>
>>> Thanks,
>>>
>>> Anthony
>
