Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: DimensionalReduction (http://cwiki.apache.org/confluence/display/MAHOUT/DimensionalReduction)
Edited by Jake Mannix:

h1. Dimensional Reduction
Matrix algebra underpins the way many Big Data algorithms and data structures are composed:
fulltext search can be viewed as doing matrix multiplication of the termdocument matrix
by the query vector (giving a vector over documents where the components are the relevance
score), computing cooccurrences in a collaborative filtering context (people who viewed X
also viewed Y, or ratingsbased CF like the Netflix Prize contest) is taking the squaring
the useritem interation matrix, calculating users who are kdegrees separated from each other
in a social network or webgraph can be found by looking at the kfold product of the graph
adjacency matrix, and the list goes on (and these are all cases where the linear structure
of the matrix is preserved!)
Each of these examples deal with cases of matrices which tend to be tremendously large (often
millions to tens of millions to hundreds of millions of rows or more, by sometimes a comparable
number of columns), but also rather sparse. Sparse matrices are nice in some respects: dense
matrices which are 10^7 on a side would have 100 trillion nonzero entries! But the sparsity
is often problematic, because any given two rows (or columns) of the matrix may have zero
overlap. Additionally, any machinelearning work done on the data which comprises the rows
has to deal with what is known as "the curse of dimensionality", and for example, there are
too many columns to train most regression or classification problems on them independently.
One of the more useful approaches to dealing with such huge sparse data sets is the concept
of dimensionality reduction, where a lower dimensional space of the original column (feature)
space of your data is found / constructed, and your rows are mapped into that subspace (or
submanifold). In this reduced dimensional space, "important" components to distance between
points are exaggurated, and unimportant ones washed away, and additionally, sparsity of your
rows is traded for drastically reduced dimensional, but dense "signatures". While this loss
of sparsity can lead toits own complications, a proper dimensionality reduction can help reveal
the most important features of your data, expose correlations among your supposedly independent
original variables, and smooth over the zeroes in your correlation matrix.
One of the most straightforward techniques for dimensionality reduction is the matrix decomposition:
singular value decomposition, eigen decomposition, nonnegative matrix factorization, etc.
In their truncated form these decompositions are an excellent first approach toward linearity
preserving unsupervised feature selection and dimensional reduction. Of course, sparse matrices
which don't fit in RAM need special treatment as far as decomposition is concerned. Parallelizable
and/or streamoriented algorithms are needed.
h1. Singular Value Decomposition
Currently implemented in Mahout (as of 0.3, the first release with MAHOUT180 applied), are
two scalable implementations of SVD, a streamoriented implementation using the Asymmetric
Generalized Hebbian Algorithm outlined in Genevieve Gorrell & Brandyn Webb's paper [http://www.dcs.shef.ac.uk/~genevieve/gorrell_webb.pdf
Gorrell and Webb(2005)]; and there is a Lanczos implementation, both singlethreaded, and
in the o.a.m.math.decomposer.lanczos package (math module), as a hadoop mapreduce (series
of) job(s) in o.a.m.math.hadoop.decomposer package (core module). Coming soon: stochastic
decomposition.
Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action
