[ https://issues.apache.org/jira/browse/MAHOUT512?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13157657#comment13157657
]
Raphael Cendrillon commented on MAHOUT512:

A few comments might be in order:
 SVD is currently done sequentially. This means that the code is scalable with size of training
data but not so much for feature size. I'd like to hook this into SSVD to make the feature
size scalable as well.
 There is a fair ammount of code duplication between PcaMatrix and DistributedRowMatrix.
This has been done to minimize impact on existing code. If MatrixCovarianceJob, MatrixRowAverageJob,
MatrixVectorAdditionJob etc. are found to be useful in other places then it might be worth
integrating these jobs directly into DistributedRowMatrix
 Projection into feature space and reconstruction using principal components is currently
done with TransposeJob and MatrixMultiplicationJob. It may be worth implementing these as
seperate M/R jobs for higher efficiency
 The is some code replication between some of the jobs. It might be worthwhile to push the
common code into an abstract class
 Currently MatrixVectorAdditionJob uses an outer join to subtract the vector from each row
of the matrix. There are probably more efficient ways to do this.
> Principal Component Analysis for mahoutexamples
> 
>
> Key: MAHOUT512
> URL: https://issues.apache.org/jira/browse/MAHOUT512
> Project: Mahout
> Issue Type: New Feature
> Reporter: Max Heimel
> Assignee: Max Heimel
> Priority: Minor
> Attachments: MAHOUT512.patch
>
> Original Estimate: 2,016h
> Remaining Estimate: 2,016h
>
> h2.Overview:
> Principle Component Analysis is a widely used statistical method for decorrelation and
dimensionreduction. It is typically applied for denoising data, data compression and as a
preprocessing step for more advanced statistical analysis tools (like e.g. independent component
analysis). The main idea of PCA is to apply a linear transformation of the given data points
into a  typically lessdimensional  feature space, such that the direction of greatest variance
within the data lies on the first dimension within the feature space.
> One approach to performing PCA is by transforming the ddimensional data into the space
spanned by the p largest eigenvectors of the covariance matrix of the (centered) data. This
is done in the following steps (assume the data is given by a nxd data matrix containing n
data rows with dimensionality d):
> * Compute the ddimensional empirical meanvector m of the data points
> * Center the data by subtracting m from each data point.
> * Compute the covariance matrix C = E' * E, where E is the centered data matrix and E'
is its transpose.
> * Compute the eigenvalue decomposition of C, such that e_i is the eigenvector corresponding
to eigenvalue lambda_i, order i such that lambda_i > lambda_j implies i<j.
> * Create the transformation matrix T as the p largest eigenvectors (e_0 ... e_p) of C.
> * Transforming a data matrix into the PCA space: R=(DM)*T
> * Transforming data from PCA to original space: D=(R*T')+M
> h2.Implementation overview:
> I would suggest implementing PCA for the mahoutexamples subproject. Maybe some of the
jobs could also go into mahoutcore/math, if wished. The implementation would consist of a
set of M/R jobs and some gluing code. I assume data is given in HDFS as a file of ddimensional
data points, where each row corresponds to a data point. The mapreduce based implementation
of PCA for Mahout would then consist of the following code:
> # Distributed Data centering code, that computes the emprical mean vector and centers
the data
> # Distributed Code for computing the covariance matrix.
> # Sequential ( ? ) glue code for computing the eigenvalue decomposition of the covariance
matrix and constructing the transformation matrices T and T'. This code would use the existing
eigensolvers of Mahout.
> # Distributed code to transform data into the PCA space. This code would use the existing
distributed matrix multiplication of mahout & the data centering job
> # Distributed code to transforms PCA data back into the original space. This code would
use the existing distributed matrix multiplication of mahout & the data centering job.
> # Glue code that combines all jobs into a coherent implementation of PCA.
> h2.Implementation details:
> For a complete implementation the following three map/reduce jobs have to be written:
> # A M/R job to compute the empirical mean vector. Each row gets a row (data point) and
emits (dimension, [value,1]) tuples. A local combiner preaggregates the values for each dimension
to (dimension, [sum of values, nr of datapoints]). Finally, a (single) reducer computes the
average for each dimension and stores back the mean vector.
> # A M/R job to center (uncenter) the data. The job consists of a single mapphase that
reads in a row (data point) of the original matrix, subtracts (or adds) the empirical mean
vector and emits (row number, new ro) pairs that are written back to disk.
> # A M/R job to compute the matrix product of a large matrix with its transpose. Principle
idea: each mapper reads in a row of the matrix and computes the products of all combinations
(e.g. [a,b,c] > (0,aa)(1,ab)(2,ac)(3,bb)(4,bc)(5,cc)). The key corresponds to the position
within the symmetric result matrix. A local combiner sums up the local products for each key,
the reducer sums up the sum of products for each mapper per key and stores (position, value)
tuples to disk. Finally a (local) cleanup phase constructs the covariance matrix from the
(position, value) tuples. Since the covariance matrix is dxd and d is typically low compared
to the number of data points, I assume the final local step should be fine.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
