Hi guys,

I am trying to implement a block matrix-vector multiplication algorithm with Hadoop according to the schematics from http://i.stanford.edu/~ullman/mmds/ch5.pdf page 162. My matrix is going to be sparse and the vector dense which is exactly what is required in PageRank as well. The vector is assumed to not fit in memory. Just to reiterate: I am not trying to implement PageRank, I want to implement the matrix-vector multiplication strategy as described in the PDF.

The way I was thinking about a possible implementation is to use CompositeInputFormat and basically perform a map-side join of a matrix block with a vector block and sum up the intermediate result blocks that contribute to the final result block in the combiner/reducer. The difference between this approach and a general map-side join is that I would need to join several blocks of the matrix with a single block of the vector. I know about the requirements of a map-side join in Hadoop concerning splits, ordering etc. and in terms of splitting I would take care of that by having one file for each block named accordingly. To me it looks like I need to modify the code that determines which files shall be joined so Hadoop wouldn't want to join files from the two paths with the same name anymore, but rather according to my defined scheme. However, I don't know if that's a valid approach and where this part of the code is located. I looked into the Hadoop source, but couldn't find it.

I know that I could do this in 2 passes with a reduce-side join, but efficiency is critical here because these matrix-vector multiplication need to be executed in large amounts. Distributed Cache is out of the question because the vector doesn't fit in memory.

Does anyone have an idea how to tackle this problem?