mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Danny Leshem (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer
Date Tue, 23 Feb 2010 17:00:27 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837310#action_12837310
] 

Danny Leshem edited comment on MAHOUT-180 at 2/23/10 4:59 PM:
--------------------------------------------------------------

While testing the new code, I encountered the following issue:

...
10/02/23 18:11:17 INFO lanczos.LanczosSolver: LanczosSolver finished.
10/02/23 18:11:17 ERROR decomposer.EigenVerificationJob: Unexpected --input while processing
Options
Usage:                                                                          
 [--eigenInput <eigenInput> --corpusInput <corpusInput> --help --output      
  
<output> --inMemory <inMemory> --maxError <maxError> --minEigenvalue   
        
<minEigenvalue>]                                                                
Options                  
...

The problem seems to be in DistributedLanczosSolver.java [73]:
EigenVerificationJob expects the parameters' names to be "eigenInput" and "corpusInput", but
you're mistakenly passing them as "input" and "output".

Other than this minor issue, the code seems to be working fine and indeed produces the right
amount of dense (eigen?) vectors.

      was (Author: dleshem):
    While testing the new code, I encountered the following issue:

...
10/02/23 18:11:17 INFO lanczos.LanczosSolver: LanczosSolver finished.
10/02/23 18:11:17 ERROR decomposer.EigenVerificationJob: Unexpected --input while processing
Options
Usage:                                                                          
 [--eigenInput <eigenInput> --corpusInput <corpusInput> --help --output      
  
<output> --inMemory <inMemory> --maxError <maxError> --minEigenvalue   
        
<minEigenvalue>]                                                                
Options                  
...

The problem seems to be in DistributedLanczosSolver.java [73]:
EigenVerificationJob expects the parameters' names to be "--eigenInput" and "--corpusInput",
but you're mistakenly passing them as "--input" and "--output".

Other than this minor issue, the code seems to be working fine and indeed produces the right
amount of dense (eigen?) vectors.
  
> port Hadoop-ified Lanczos SVD implementation from decomposer
> ------------------------------------------------------------
>
>                 Key: MAHOUT-180
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-180
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>    Affects Versions: 0.2
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>            Priority: Minor
>             Fix For: 0.3
>
>         Attachments: MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch,
MAHOUT-180.patch
>
>
> I wrote up a hadoop version of the Lanczos algorithm for performing SVD on sparse matrices
available at http://decomposer.googlecode.com/, which is Apache-licensed, and I'm willing
to donate it.  I'll have to port over the implementation to use Mahout vectors, or else add
in these vectors as well.
> Current issues with the decomposer implementation include: if your matrix is really big,
you need to re-normalize before decomposition: find the largest eigenvalue first, and divide
all your rows by that value, then decompose, or else you'll blow over Double.MAX_VALUE once
you've run too many iterations (the L^2 norm of intermediate vectors grows roughly as (largest-eigenvalue)^(num-eigenvalues-found-so-far),
so losing precision on the lower end is better than blowing over MAX_VALUE).  When this is
ported to Mahout, we should add in the capability to do this automatically (run a couple iterations
to find the largest eigenvalue, save that, then iterate while scaling vectors by 1/max_eigenvalue).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message