mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Doing dimensionality reduction with SSVD and Lanczos
Date Thu, 06 Sep 2012 17:17:46 GMT
When using Laczos the recommendation is to use clean eigen vectors as a distributed row matrix--call
it V.

A-hat = A^t V^t this per the clusterdump tests DSVD and DSVD2.

Dmitriy and Ted recommend when using SSVD to do:

A-hat = US 

When using PCA it's also preferable to use --uHalfSigma to create U with the SSVD solver.
One difficulty is that to perform the multiplication you have to turn the singular values
vector (diagonal values) into a distributed row matrix or write your own multiply function,
correct?

Questions:
For SSVD can someone explain why US is preferred? Given A = USV^t how can you ignore the effect
of V^t? Is this only for PCA? In other words if you did not use PCA weighting would you ignore
V^t?
For Lanczos A-hat = A^t V^t seems to strip doc id during transpose, am I mistaken? Also shouldn't
A-hat be transposed before performing kmeans or other analysis?



> Dmitriy said
With SSVD you need just US  (or U*Sigma in other notation).
This is your dimensionally reduced output of your original document
matrix you've run with --pca option.

As Ted suggests, you may also use US^0.5 which is already produced by
providing --uHalfSigma (or its embedded setter analog). the keys of
that output (produced by getUPath() call) will already contain your
Text document ids as sequence file keys.

-d



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message