mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joris Geessels (JIRA)" <>
Subject [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
Date Fri, 31 Dec 2010 09:27:46 GMT


Joris Geessels commented on MAHOUT-537:

The matrix matrix multiplication seems like an ugly hack to me, I'm actually in favor to keep
using the old API until we can switch to 21.0. 
Some remarks: 
1) I didn't test the code either, but couldn't spot any obvious errors. So it seems to me
that it should work.
2 ) This implementation uses 3 M/R jobs where the original one has only 1. I agree that the
first 2 two jobs are very basic operations, but still for performance's sake it's better to
keep the amount of jobs low.  I'm almost 100% certain that this implementation will be slower
than the original one ( though I have no idea how much slower, would be interesting to know
3 ) Every row of the DRM now has an extra String variable to store and send. Certainly when
the matrix is very sparse this will result in a substantial overhead. 
4 ) the MatrixMultiplicationReducer receives a NamedVectorWritable, but there's no reason
for this. It would be better to use a plain VectorWritable.

If we insist in compliance with 20.2, it might be interesting to have a look at: 
This implementation avoids the use of compositeinputformat by checking the current inputpath
 in the setup. 

Some more general remarks: I think the matrix multiplication can be implemented more efficiently.
I've done a matrix multiplication of a sparse 500kx15k matrix with around 35 million elements
on a quite powerful cluster of 10 nodes, and this took around 30 minutes. I have no idea of
the performance of the implementation described at,
so I can't really compare. But Imho this can be improved ( though it's possible that the poor
performance was due to mistakes made by me )

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>                 Key: MAHOUT-537
>                 URL:
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.4
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular
eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message