mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joris Geessels (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
Date Fri, 31 Dec 2010 09:27:46 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976204#action_12976204
] 

Joris Geessels commented on MAHOUT-537:
---------------------------------------

The matrix matrix multiplication seems like an ugly hack to me, I'm actually in favor to keep
using the old API until we can switch to 21.0. 
Some remarks: 
1) I didn't test the code either, but couldn't spot any obvious errors. So it seems to me
that it should work.
2 ) This implementation uses 3 M/R jobs where the original one has only 1. I agree that the
first 2 two jobs are very basic operations, but still for performance's sake it's better to
keep the amount of jobs low.  I'm almost 100% certain that this implementation will be slower
than the original one ( though I have no idea how much slower, would be interesting to know
) 
3 ) Every row of the DRM now has an extra String variable to store and send. Certainly when
the matrix is very sparse this will result in a substantial overhead. 
4 ) the MatrixMultiplicationReducer receives a NamedVectorWritable, but there's no reason
for this. It would be better to use a plain VectorWritable.

If we insist in compliance with 20.2, it might be interesting to have a look at:
http://homepage.mac.com/j.norstad/matrix-multiply/index.html 
This implementation avoids the use of compositeinputformat by checking the current inputpath
 in the setup. 

Some more general remarks: I think the matrix multiplication can be implemented more efficiently.
I've done a matrix multiplication of a sparse 500kx15k matrix with around 35 million elements
on a quite powerful cluster of 10 nodes, and this took around 30 minutes. I have no idea of
the performance of the implementation described at http://homepage.mac.com/j.norstad/matrix-multiply/index.html,
so I can't really compare. But Imho this can be improved ( though it's possible that the poor
performance was due to mistakes made by me )

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.4
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular
eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration
objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message