mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy Lyubimov (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (MAHOUT-376) Implement Map-reduce version of stochastic SVD
Date Wed, 01 Dec 2010 07:49:11 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965597#action_12965597
] 

Dmitriy Lyubimov edited comment on MAHOUT-376 at 12/1/10 2:48 AM:
------------------------------------------------------------------

Final trunk patch for CDH3 or 0.21 api. 
This includes code cleanup, javadoc updates, and mahout CLI class (not tested though). 

all existing tests and this test are passing. I tested 100Kx100 matrix in local mode only,
S values coincide with 1e-10 or better.

changes to dependencies i had to make 
* hadoop 0.21 or cdh3 to support multiple outputs 
* local MR mode has dependency on commons-http client, so i included it for test scope only
in order for test to work
*  changed apache-math dependency from 1.2 (?) to 2.1. Actually mahout math module seems to
depend on 2.1 too, not clear why it was not transitive for this one. 
* commons-math 1.2 seemed to have depended on commons-cli and 2.1 doesn't have it transitively
anymore, but one of  the classes in core required it. so i added commons-cli in order to fix
the build.

*Ted*, sorry i kind of polluted your issue here. Thank you for your encouragement and help.
i probably should've opened another issue once it was clear it diverged far enough, instead
of keep putting stuff here. 

This should be compatible with DistributedRowMatrix. I did not have real distributed test
yet as i don't have a suitable data set yet, but perhaps somebody in the user community with
the interest in the method could do it faster than i get to it. I will do tests with moderate
scale at some point but i don't want to do it on my company's machine cluster yet and i don't
exactly own a good one myself.

I did have a rather mixed use of mahout vector math and just dense arrays. Partly becuase
i did not quite have enough time to study all capabilities in math module, and partly becuase
i wanted explicit access to memory for control over its more efficient re-use in mass iterations.
 This may or may not need be rectified over time. But it seems to work pretty well as is.

The patch is git patch (so one needs to use patch -p1 instead of -p0). I know the standard
is set to use svn patches... but i already used git for pulling the trunk  (so happens i prefer
git in general too so i can have my own commit tree and branching for this work). 

If there's enough interest from the project to this contribution, i will support it, and if
requested, i can port it to 0.20 if that's the target platform for 0.5, as well as doing other
specific mahout architectural tweaks.  Please kindly let me know. 


Thank you.

      was (Author: dlyubimov2):
    Final trunk patch for CDH3 or 0.21 api. 
This includes code cleanup, javadoc updates, and mahout CLI class (not tested though). 

all existing tests and this test are passing. I tested 100Kx100 matrix in local mode only,
S values coincide with 1e-10 or better.

changes to dependencies i had to make 
* hadoop 0.21 or cdh3 to support multiple outputs 
* local MR mode has dependency on commons-http client, so i included it for test scope only
in order for test to work
*  changed apache-math dependency from 1.2 (?) to 2.1. Actually mahout math module seems to
depend on 2.1 too, not clear why it was not transitive for this one. 
* commons-math 1.2 seemed to have depended on commons-cli and 2.1 doesn't have it transitively
anymore, but one of  the classes in core required it. so i added commons-cli in order to fix
the build.

*Ted*, sorry i kind of polluted your issue here. Thank you for your encouragement and help.
i probably should've opened another issue once it was clear it diverged far enough, instead
of keep putting stuff here. 

This should be compatible with DistributedRowMatrix. I did not have real distributed test
yet as i don't have a suitable data set yet, but perhaps somebody in the user community with
the interest in the method could do it faster than i get to it. I will do tests with moderate
scale at some point but i don't want to do it on my company's grounds yet and i don't exactly
own a good one myself.

I did have a rather mixed use of mahout vector math and just dense arrays. Partly becuase
i did not quite have enough time to study all capabilities in math module, and partly becuase
i wanted explicit access to memory for control over its more efficient re-use in mass iterations.
 This may or may not need be rectified over time. But it seems to work pretty well as is.

The patch is git patch (so one needs to use patch -p1 instead of -p0). I know the standard
set to use svn patches... but i already used git for pulling the trunk  (i prefer git in general
too). 

If there's enough interest from the project to this contribution, i will support it, and if
requested, i can port it to 0.20 if that's the target platform for 0.5, as well as doing other
specific mahout architectural tweaks.  Please kindly let me know. 


Thank you.
  
> Implement Map-reduce version of stochastic SVD
> ----------------------------------------------
>
>                 Key: MAHOUT-376
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-376
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.5
>
>         Attachments: MAHOUT-376.patch, Modified stochastic svd algorithm for mapreduce.pdf,
QR decomposition for Map.pdf, QR decomposition for Map.pdf, QR decomposition for Map.pdf,
sd-bib.bib, sd.pdf, sd.pdf, sd.pdf, sd.pdf, sd.tex, sd.tex, sd.tex, sd.tex, SSVD working notes.pdf,
SSVD working notes.pdf, SSVD working notes.pdf, ssvd-CDH3-or-0.21.patch.gz, ssvd-m1.patch.gz,
ssvd-m2.patch.gz, ssvd-m3.patch.gz, Stochastic SVD using eigensolver trick.pdf
>
>
> See attached pdf for outline of proposed method.
> All comments are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message