On another note, Sean is absolutely correct, Amazon ElasticMR indeed seems
to be stuck with 0.20 (or, rather, stuck with a particular hadoop setup
without much flexibility here). I guess moving ahead with APIs in Mahout
would indeed create problems for whoever is using EMR (I don't).
On Sun, Jan 2, 2011 at 9:32 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> I would think blockwise multiplication (which is, by the way, has a
> standard algorithm in Matrix computations by Van Loan and Golub), is pretty
> pointless with Mahout since there's no blockwise matrix format presently,
> and even if it were, no existing algorithms support it. All prep utils only
> produce rowwise format. We could write a routine to "block" it but it would
> seem to be an exercise in futility.
>
> Second remark is that blockwise multiplication is also pointless for
> sufficiently sparse matrices. Indeed, sum of outer products of columns and
> rows with intermediate reduction in combiners is by far most promising in
> terms of shuffle/sort io. Outer products, when further split in columns or
> rows, would also be quite sparse and hence small in size while reduction
> in keyset cardinality is just gigantic compared to blockwise
> multiplications. (That said, i never ran comparison benchmark of the two)
>
> Note that what authors essentially are suggesting (even in strategy 4)
> that there is explosive growth of shuffle and sort keyset i/o, and what's
> more, they say they never tried it in distributed mode(!). imagine hundreds
> of machines sending a copy of their input to a lot of other machines in the
> cluster. Summing outer products avoids broadcasting the input to multiple
> reducers.
>
> On another note, if input is similarly partitioned (not always the case),
> then mapside multiplication will always be I/O superior to reduceside
> multiplication since while I/O is less and especially less in the keyset
> cardinality undergoing thru sorters. The power of mapside operations comes
> from the notion that yes we require a lot from the input but no, it's not a
> lot if input is already part of a bigger MR pipeline.
>
> Finally, back to 0.20/0.21 issue... I said before in this thread that
> migrating to 0.21 would render Mahout incompatible with majority of
> production frameworks out there. But after working with ssvd code, i came to
> think of a compromise: since most of the production environments are running
> Cloudera distribution, many 0.21 things are supported there and there's a
> lot of code around that's written for new API which is backported in
> Cloudera. It's difficult for me to judge how much Cloudera's implementation
> covers of what is in 0.21 (in fact, i did come across a couple of 0.21
> things still missing in CDH), but in terms of Hadoop compatibility, i think
> Mahout project would be best served if it indeed moved on to a new api (i.e.
> 0.21 ) but would not get ahead of what is supported in CDH3. That would keep
> it on the edge of what's currently practical and out there. Keeping sitting
> on the old api IMO is definitely a drag. My stochastic svd code is using new
> api in CDH3 and i would very much not want to backport it to old api, it
> would not be practical as everyone out there is on CDH and more so than on
> 0.20.2.
>
> Dmitriy
>
>
>
>> Some more general remarks: I think the matrix multiplication can be
>>> implemented more efficiently. I've done a matrix multiplication of a sparse
>>> 500kx15k matrix with around 35 million elements on a quite powerful cluster
>>> of 10 nodes, and this took around 30 minutes. I have no idea of the
>>> performance of the implementation described at
>>> http://homepage.mac.com/j.norstad/matrixmultiply/index.html, so I can't
>>> really compare. But Imho this can be improved ( though it's possible that
>>> the poor performance was due to mistakes made by me )
>>>
>> I will definitely investigate these methods over the coming days, these
>> look fantastic.
>>
>> Shannon
>>
>
>
