mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Danny Leshem (JIRA)" <j...@apache.org>
Subject [jira] Created: (MAHOUT-369) Issues with DistributedLanczosSolver output
Date Wed, 07 Apr 2010 13:31:33 GMT
Issues with DistributedLanczosSolver output
-------------------------------------------

                 Key: MAHOUT-369
                 URL: https://issues.apache.org/jira/browse/MAHOUT-369
             Project: Mahout
          Issue Type: Bug
          Components: Math
    Affects Versions: 0.3, 0.4
            Reporter: Danny Leshem


DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
{code}
    log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues to: "
+ outputPath);
{code}

However, a few lines later (line 106) we have
{code}
    for(int i=0; i<eigenVectors.numRows() - 1; i++) {
        ...
    }
{code}

which only persists eigenVectors.numRows()-1 vectors.

Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue) is
omitted... off by one bug?


Also, I think it would be better if the eigenvectors are persisted in *reverse* order, meaning
the most significant vector is marked "0", the 2nd most significant is marked "1", etc.

This, for two reasons:
1) When performing another PCA on the same corpus (say, with more principal componenets),
corresponding eigenvalues can be easily matched and compared.  
2) Makes it easier to discard the least significant principal components, which for Lanczos
decomposition are usually garabage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message