mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [CONF] Apache Mahout > Dimensional Reduction
Date Tue, 06 Jul 2010 00:00:01 GMT
Space: Apache Mahout (
Page: Dimensional Reduction (

Edited by Grant Ingersoll:
Matrix algebra underpins the way many Big Data algorithms and data structures are composed:
full-text search can be viewed as doing matrix multiplication of the term-document matrix
by the query vector (giving a vector over documents where the components are the relevance
score), computing co-occurrences in a collaborative filtering context (people who viewed X
also viewed Y, or ratings-based CF like the Netflix Prize contest) is taking the squaring
the user-item interaction matrix, calculating users who are k-degrees separated from each
other in a social network or web-graph can be found by looking at the k-fold product of the
graph adjacency matrix, and the list goes on (and these are all cases where the linear structure
of the matrix is preserved!)

Each of these examples deal with cases of matrices which tend to be tremendously large (often
millions to tens of millions to hundreds of millions of rows or more, by sometimes a comparable
number of columns), but also rather sparse. Sparse matrices are nice in some respects: dense
matrices which are 10^7 on a side would have 100 trillion non-zero entries! But the sparsity
is often problematic, because any given two rows (or columns) of the matrix may have zero
overlap. Additionally, any machine-learning work done on the data which comprises the rows
has to deal with what is known as "the curse of dimensionality", and for example, there are
too many columns to train most regression or classification problems on them independently.

One of the more useful approaches to dealing with such huge sparse data sets is the concept
of dimensionality reduction, where a lower dimensional space of the original column (feature)
space of your data is found / constructed, and your rows are mapped into that subspace (or
sub-manifold).  In this reduced dimensional space, "important" components to distance between
points are exaggerated, and unimportant ones washed away, and additionally, sparsity of your
rows is traded for drastically reduced dimensional, but dense "signatures". While this loss
of sparsity can lead to its own complications, a proper dimensionality reduction can help
reveal the most important features of your data, expose correlations among your supposedly
independent original variables, and smooth over the zeroes in your correlation matrix.

One of the most straightforward techniques for dimensionality reduction is the matrix decomposition:
singular value decomposition, eigen decomposition, non-negative matrix factorization, etc.
In their truncated form these decompositions are an excellent first approach toward linearity
preserving unsupervised feature selection and dimensional reduction. Of course, sparse matrices
which don't fit in RAM need special treatment as far as decomposition is concerned. Parallelizable
and/or stream-oriented algorithms are needed.

h1. Singular Value Decomposition

Currently implemented in Mahout (as of 0.3, the first release with MAHOUT-180 applied), are
two scalable implementations of SVD, a stream-oriented implementation using the Asymmetric
Generalized Hebbian Algorithm outlined in Genevieve Gorrell & Brandyn Webb's paper ([Gorrell
and Webb 2005|]); and there is a [Lanczos
|] implementation, both single-threaded, and
in the o.a.m.math.decomposer.lanczos package (math module), as a hadoop map-reduce (series
of) job(s) in o.a.m.math.hadoop.decomposer package (core module). Coming soon: stochastic

h2. Lanczos

The Lanczos algorithm is designed for eigen-decomposition, but like any such algorithm, getting
singular vectors out of it is immediate (singular vectors of matrix A are just the eigenvectors
of A^t * A or A * A^t).  Lanczos works by taking a starting seed vector *v* (with cardinality
equal to the number of columns of the matrix A), and repeatedly multiplying A by the result:
*v'* = A.times(*v*) (and then subtracting off what is proportional to previous *v'*'s, and
building up an auxiliary matrix of projections).  In the case where A is not square (in general:
not symmetric), then you actually want to repeatedly multiply A*A^t by *v*: *v'* = (A * A^t).times(*v*),
or equivalently, in Mahout, A.timesSquared(*v*) (timesSquared is merely an optimization: by
changing the order of summation in A*A^t.times(*v*), you can do the same computation as one
pass over the rows of A instead of two).

After *k* iterations of *v_i* = A.timesSquared(*v_(i-1)*), a *k*-by-*k* tridiagonal matrix
has been created (the auxiliary matrix mentioned above), out of which a good (often extremely
good) approximation to *k* of the singular values (and with the basis spanned by the *v_i*,
the *k* singular *vectors* may also be extracted) of A may be efficiently extracted.  Which
*k*?  It's actually a spread across the entire spectrum: the first few will most certainly
be the largest singular values, and the bottom few will be the smallest, but you have no guarantee
that just because you have the n'th largest singular value of A, that you also have the (n-1)'st
as well.  A good rule of thumb is to try and extract out the top 3k singular vectors via Lanczos,
and then discard the bottom two thirds, if you want primarily the largest singular values
(which is the case for using Lanczos for dimensional reduction).

h3. Parallelization Stragegy

Lanczos is "embarassingly parallelizable": matrix multiplication of a matrix by a vector may
be carried out row-at-a-time without communication until at the end, the results of the intermediate
matrix-by-vector outputs are accumulated on one final vector.  When it's truly A.times(*v*),
the final accumulation doesn't even have collision / synchronization issues (the outputs are
individual separate entries on a single vector), and multicore approaches can be very fast,
and there should also be tricks to speed things up on Hadoop.  In the asymmetric case, where
the operation is A.timesSquared(*v*), the accumulation does require synchronization (the vectors
to be summed have nonzero elements all across their range), but delaying writing to disk until
Mapper close(), and remembering that having a Combiner be the same as the Reducer, the bottleneck
in accumulation is nowhere near a single point.

h3. Mahout usage

Coming soon...

Short form: 
  <MAHOUT_HOME>/bin/mahout svd \
-Dmapred.input.dir=path/to/corpus --tempDir path/for/svd-output --rank 300 --numColumns <numcols>
--numRows <num rows in the input>

path/to/corpus is the location on HDFS where a SequenceFile<Writable,VectorWritable>
(preferably SequentialAccessSparseVectors instances) lies which you wish to decompose.  Each
vector of which has numcols entries.  numRows is used to properly size the matrix data structures.

Currently, also necessary is some "cleaning" of the eigen/singular output:

  <MAHOUT_HOME>/bin/mahout cleansvd \
--eigenInput path/for/svd-output --corpusInput path/to/corpus --output path/for/cleanOutput
--maxError 0.1 --minEigenvalue 10.0 

Where corpusInput is the corpus from the previous step, eigenInput is the output from the
previous step, and you have a new output path for clean output.  The two "cleaning" params
are maxError - the maximum allowed 1-cosAngle(v, A.timesSquared(v)), and minEigenvalue.  Eigenvectors
which have too large error, or too small eigenvalue are discarded.  Optional argument: --inMemory,
if you have enough memory on your local machine (not on the hadoop cluster nodes!) to load
all eigenvectors into memory at once (at least 8 bytes/double * rank * numCols), then you
have some speedups on this cleaning process.

TODO: also allow exclusion based on improper orthogonality (currently computed, but not checked
against constraints).

TODO: more details.

h1. Resources


Change your notification preferences:

View raw message