mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jake Mannix (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-369) Issues with DistributedLanczosSolver output
Date Mon, 04 Apr 2011 15:43:06 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015467#comment-13015467
] 

Jake Mannix commented on MAHOUT-369:
------------------------------------

I'll fix those imports, and add some comments on what has changed / what is being done now.

There are some more improvements which are necessary, but this is strictly better than was
what was there before, so I'll commit this as soon as I can (after adding said comments, to
be fleshed out on the list/wiki further).

> Issues with DistributedLanczosSolver output
> -------------------------------------------
>
>                 Key: MAHOUT-369
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-369
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.3, 0.4
>            Reporter: Danny Leshem
>            Assignee: Jake Mannix
>              Labels: DistributedLanczosSolver, decomposer
>             Fix For: 0.5
>
>         Attachments: MAHOUT-369.diff, MAHOUT-369.patch
>
>
> DistributedLanczosSolver (line 99) claims to persist eigenVectors.numRows() vectors.
> {code}
>     log.info("Persisting " + eigenVectors.numRows() + " eigenVectors and eigenValues
to: " + outputPath);
> {code}
> However, a few lines later (line 106) we have
> {code}
>     for(int i=0; i<eigenVectors.numRows() - 1; i++) {
>         ...
>     }
> {code}
> which only persists eigenVectors.numRows()-1 vectors.
> Seems like the most significant eigenvector (i.e. the one with the largest eigenvalue)
is omitted... off by one bug?
> Also, I think it would be better if the eigenvectors are persisted in *reverse* order,
meaning the most significant vector is marked "0", the 2nd most significant is marked "1",
etc.
> This, for two reasons:
> 1) When performing another PCA on the same corpus (say, with more principal componenets),
corresponding eigenvalues can be easily matched and compared.  
> 2) Makes it easier to discard the least significant principal components, which for Lanczos
decomposition are usually garbage.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message